iotaledger / inx-tendercoo

INX-Tendercoo enables a committee of validators to operate as a distributed Coordinator using Tendermint Core BFT consensus.
Apache License 2.0
1 stars 0 forks source link

No recovery after "broadcast did not pass CheckTx: invalid state" #47

Closed shufps closed 1 year ago

shufps commented 1 year ago

When restarting all tendercoos after the latest problems, there are logs like:

2022-12-25T09:52:48Z    INFO    Coordinator     Providing Tendermint ... done
2022-12-25T09:52:48Z    INFO    App     Loading core components ...
2022-12-25T09:52:48Z    INFO    App     Loading core components: INX ... done
2022-12-25T09:52:48Z    INFO    App     Loading core components: Coordinator ... done
2022-12-25T09:52:48Z    INFO    App     Loading core components: Shutdown ... done
2022-12-25T09:52:48Z    INFO    App     Loading plugins ...
2022-12-25T09:52:48Z    INFO    App     Loading plugin: Profiling ... done
2022-12-25T09:52:48Z    INFO    App     Executing core components ...
2022-12-25T09:52:48Z    INFO    App     Starting core component: INX ... done
2022-12-25T09:52:48Z    INFO    App     Starting core component: Coordinator ... done
2022-12-25T09:52:48Z    INFO    App     Starting core component: Shutdown ... done
2022-12-25T09:52:48Z    INFO    App     Executing plugins ...
2022-12-25T09:52:48Z    INFO    App     Starting plugin: Profiling ... done
2022-12-25T09:52:48Z    INFO    App     Starting background workers ...
2022-12-25T09:52:48Z    INFO    Coordinator     Starting Coordinator ...
2022-12-25T09:52:48Z    INFO    Coordinator     Coordinator started     {"publicKey": "d9922819a39e94ddf3907f4b9c8df93f39f026244fcb609205b9a879022599f2"}
2022-12-25T09:52:48Z    INFO    Coordinator     Starting Coordinator ... done
2022-12-25T09:52:48Z    INFO    INX     Starting NodeBridge ...
2022-12-25T09:52:48Z    INFO    Profiling       You can now access the profiling server using: http://0.0.0.0:6060/debug/pprof/
2022-12-25T09:52:48Z    INFO    Coordinator     Starting TangleListener ... done
2022-12-25T09:52:48Z    INFO    Coordinator     Starting Tendermint Node ...
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"msg": "Starting Node service", "impl": "Node"}
2022-12-25T09:52:48Z    INFO    Coordinator     proposing parent for milestone 7296251
2022-12-25T09:52:48Z    WARN    Coordinator     defaulting to last milestone as tip: failed to select tips: no tips available
2022-12-25T09:52:48Z    INFO    Tendermint      serve   {"module": "rpc-server", "msg": "Starting RPC HTTP server on 127.0.0.1:26657"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "p2p", "msg": "Starting P2P Switch service", "impl": "P2P Switch"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "evidence", "msg": "Starting Evidence service", "impl": "Evidence"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "statesync", "msg": "Starting StateSync service", "impl": "StateSync"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "blockchain", "msg": "Starting BlockchainReactor service", "impl": "BlockchainReactor"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "blockchain", "msg": "Starting BlockPool service", "impl": "BlockPool"}
2022-12-25T09:52:48Z    INFO    Tendermint      service start   {"module": "consensus", "msg": "Starting Consensus service", "impl": "ConsensusReactor"}
2022-12-25T09:52:48Z    INFO    Tendermint      Reactor         {"module": "consensus", "waitSync": true}
2022-12-25T09:52:48Z    INFO    Tendermint      Saving AddrBook to file {"module": "p2p", "book": "tendermint/config/addrbook.json", "size": 9}
2022-12-25T09:52:48Z    INFO    Coordinator     Started Tendermint Node: Address=0e27bbbe79f7d010439c78273d1b35f1ce4cfaf0@0.0.0.0:26656, ConsensusPublicKey=d9922819a39e94ddf3907f4b9c8df93f39f026244fcb609205b9a879022599f2
2022-12-25T09:52:48Z    DEBUG   Coordinator     broadcast tx    {"parent": {"Index":7296251,"BlockID":"30ef42d339634a5d5c69a427869b06827765af80f5ed3eb4dd41c29506903130"}}
2022-12-25T09:52:48Z    WARN    Coordinator     failed to propose parent: invalid proposal: already applied
2022-12-25T09:52:48Z    INFO    Tendermint      Dialing peer    {"module": "p2p", "address": "367ca8b17443585a2d013da2916c10bfec29d7f7@54.219.238.115:26656"}
2022-12-25T09:52:48Z    WARN    Coordinator     broadcast did not pass CheckTx: invalid state

After the last line there are no further lines containing Coordinator.

Not even a couple of minutes ... It seems something is crashing silently in the Coordinator and never recovering.

It's not clear if this error normally shouldn't happen at all or there should be some way of cleanly recover from this :man_shrugging:

shufps commented 1 year ago

Hmm, the invalid proposal: already applied also happened here: https://github.com/iotaledger/inx-tendercoo/issues/45

Wollac commented 1 year ago

Unfortunately, this was to be expected. If the situation as described in https://github.com/iotaledger/inx-tendercoo/issues/45#issuecomment-1365812617 occurs, we have already written an invalid state to the Tendermint blockchain (all validator proposed invalid parents) and without resetting the blockchain there is no way to fix this.

An alternative would be to allow validators to propose more than one parent (and then always select the latest parent proposed) even though in a happy world it should never happen to propose more than one, we could consider this to make the design more robust (although it would probably allow an malicious validator to spam proposals)...

shufps commented 1 year ago

The recovery works but this is weird:

image

Yesterday the "invalid state" messages started on a single Coo, then on a second, then it got more on other nodes and now it seems they are getting less frequent.

Before the first "invalid state" there are regularely "submit failed but block is already present" but they are okay and no error.

Wollac commented 1 year ago

That is actually an interesting one and maybe/probably not related. The DeCoo tries to re-broadcast failed Txs after some time. There is just an overflow protection but when only a single tx fails it'll retry more or less indefinitely and an "Invalid State" will always stay invalid. Probably this re-trying policy needs to be changed, but it would still be interesting what initially triggered this... Do we still have the logs from back then?

shufps commented 1 year ago

seems to be fixed