gnolang / gno

Gno: An interpreted, stack-based Go virtual machine to build succinct and composable apps + Gno.land: a blockchain for timeless code and fair open-source
https://gno.land/
Other
841 stars 342 forks source link

Validators cannot discover P2P peers when running as `StatefulSet` in k8s #2378

Closed mazzy89 closed 1 week ago

mazzy89 commented 1 week ago

Validators cannot discover P2P peers when running as StatefulSet in k8s

Description

In a multi-node scenario, when a validator is started having configured under p2p.persistent_peers and p2p.seeds a list of nodes, the DNS lookup fails. This is an issue suffered by other similar products such as RabbitMQ which during the bootstrap phase, they try to reach other nodes/peers. See here https://github.com/kubernetes/kubernetes/issues/92559#issuecomment-1196410671

Your environment

Steps to reproduce

Expected behaviour

The DNS lookup should succeeded and the node should be connected to another peer.

Actual behaviour

The DNS lookup fails. It seems it tries for the second time but it fails after because the DNS record is not ready yet.

Logs

2024-06-18T09:29:40.184Z    INFO    Starting multi  {"module": "proxy", "impl": "multi"}
2024-06-18T09:29:40.184Z    INFO    Starting localClient    {"module": "proxy", "module": "abci-client", "connection": "query", "impl": "localClient"}
2024-06-18T09:29:40.184Z    INFO    Starting localClient    {"module": "proxy", "module": "abci-client", "connection": "mempool", "impl": "localClient"}
2024-06-18T09:29:40.184Z    INFO    Starting localClient    {"module": "proxy", "module": "abci-client", "connection": "consensus", "impl": "localClient"}
2024-06-18T09:29:40.184Z    INFO    Starting EventStoreService  {"module": "eventstore", "impl": "EventStoreService"}
2024-06-18T09:29:40.184Z    INFO    ABCI Handshake App Info {"module": "consensus", "height": 0, "hash": "", "abci-version": "", "app-version": ""}
2024-06-18T09:29:40.184Z    INFO    ABCI Replay Blocks  {"module": "consensus", "appHeight": 0, "storeHeight": 0, "stateHeight": 0}
2024-06-18T09:29:40.187Z    INFO    Completed ABCI Handshake - Tendermint and App are synced    {"module": "consensus", "appHeight": 0, "appHash": ""}
2024-06-18T09:29:40.187Z    INFO    Version info    {"version": "v1.0.0-rc.0"}
2024-06-18T09:29:40.187Z    INFO    This node is a validator    {"module": "consensus", "addr": "g1e5cn4p8z7jhdylh98jmj8ugw2532lqx8e9kmw5", "pubKey": "gpub1pggj7ard9eg82cjtv4u52epjx56nzwgjyg9zpqh25w6ev6ww6lq70elf7ylvde3zqp06dlhhw7tj0cs4j3hpt3v5mfzgq0"}
2024-06-18T09:29:40.188Z    INFO    P2P Node ID {"module": "p2p", "ID": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl", "file": "/gnoland-data/secrets/node_key.json"}
2024-06-18T09:29:40.188Z    INFO    Adding persistent peers {"module": "p2p", "addrs": ["g1x6uuzyz0t50647wt8nduyxrlyduhj0yruk6vmr@devx-gnoland-val1-0:26657", "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@devx-gnoland-val2-0:26657", "g1vpmsut2s6z89rfyqzh5234xvcs5h2rtl238x8x@devx-gnoland-val3-0:26657"]}
2024-06-18T09:29:40.281Z    ERROR   Error in peer's address {"module": "p2p", "err": "error looking up host (devx-gnoland-val1-0): lookup devx-gnoland-val1-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.282Z    ERROR   Error in peer's address {"module": "p2p", "err": "error looking up host (devx-gnoland-val3-0): lookup devx-gnoland-val3-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.282Z    INFO    Starting Node   {"impl": "Node"}
2024-06-18T09:29:40.282Z    INFO    Starting P2P Switch {"module": "p2p", "impl": "P2P Switch"}
2024-06-18T09:29:40.282Z    INFO    Starting Reactor    {"module": "mempool", "impl": "Reactor"}
2024-06-18T09:29:40.282Z    INFO    Starting BlockchainReactor  {"module": "blockchain", "impl": "BlockchainReactor"}
2024-06-18T09:29:40.282Z    INFO    Starting BlockPool  {"module": "blockchain", "impl": "BlockPool"}
2024-06-18T09:29:40.282Z    INFO    Starting ConsensusReactor   {"module": "consensus", "impl": "ConsensusReactor"}
2024-06-18T09:29:40.282Z    INFO    ConsensusReactor    {"module": "consensus", "fastSync": true}
2024-06-18T09:29:40.283Z    INFO    Starting RPC HTTP server on [::]:26657  {"module": "rpc-server"}
2024-06-18T09:29:40.358Z    ERROR   Error in peer's address {"module": "p2p", "err": "error looking up host (devx-gnoland-val1-0): lookup devx-gnoland-val1-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.358Z    ERROR   Error in peer's address {"module": "p2p", "err": "error looking up host (devx-gnoland-val3-0): lookup devx-gnoland-val3-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.359Z    DEBUG   Ignore attempt to connect to ourselves  {"module": "p2p", "addr": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@10.20.0.111:26657", "ourAddr": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@0.0.0.0:26656"}
2024-06-18T09:29:41.314Z    DEBUG   Consensus ticker    {"module": "blockchain", "numPending": 0, "total": 0, "outbound": 0, "inbound": 0}

Proposed solution

The issue should be fixed retrying multiple times the DNS lookup of the P2P peers. In a k8s environment where there are moving parts, it is crucial to have retry and backoff to increase the chance of successful connection

mazzy89 commented 1 week ago

The code https://github.com/gnolang/gno/blob/90aa89c28d3c8ad3b7d7b67d0256426bde6cfbc9/tm2/pkg/p2p/switch.go#L495 suggests that DNS lookup errors are actually ignored and skipped. However the final result is that

2024-06-18T09:29:44.285Z    DEBUG   Blockpool has no peers  {"module": "blockchain"}

there are no peers added.

mazzy89 commented 1 week ago

A workaround adopted by many upstream similar services which rely on bootstrap to discover other peers is to introduce publishNotReadyAddresses: true. This solves the problem.

mazzy89 commented 1 week ago

Reopening the issue. Seems that even introducing publishNotReadyAddresses: true in the Service does not help. Some nodes gets up properly, while some other fails. The overall bootstrap mechanism is not deterministic. I would wonder whether a retry in the DNS lookup would help.l

mazzy89 commented 1 week ago

Gave it another try and seems that after few seconds that node retries to correct to the peers which at that point have DNS available and the lookup succedeed. We can close this.