Bug: Failure when testing tcp port - potential race condition?

yorickdowne commented 3 years ago

Tested on 0.0.9, the node can fail in the following way during startup. This happens rarely when running with an ingress network, and always when running with host mapping. More on that further down.

make: go: No such file or directory

Build /go/bin/ssvnode

Build /config.yaml

Build 

Command --config=/config.yaml

Running node on address: 18.118.42.57)

2021/07/18 16:54:06 starting SSV-Node:v0.0.9

badger 2021/07/18 16:54:06 INFO: All 2 tables opened in 9ms

badger 2021/07/18 16:54:06 INFO: Discard stats nextEmptySlot: 101

badger 2021/07/18 16:54:06 INFO: Set nextTxnTs to 981

2021-07-18T16:54:07.437456Z INFO    kv/badger.go:45 Badger db initialized   {"app": "SSV-Node"}

2021-07-18T16:54:07.437638Z INFO    goclient/goclient.go:35 connecting to client... {"app": "SSV-Node", "component": "go-client", "network": "prater"}

2021-07-18T16:54:08.129540Z INFO    goclient/goclient.go:48 successfully connected to client    {"app": "SSV-Node", "component": "go-client", "network": "prater", "name": "Standard (HTTP)", "address": "https://MYCC.MYDOMAIN"}

2021-07-18T16:54:08.130202Z INFO    p2p/p2p.go:91   Ip Address  {"app": "SSV-Node", "ip": "10.0.31.50"}

2021-07-18T16:54:08.157384Z INFO    p2p/p2p.go:111  listening on port   {"app": "SSV-Node", "id": "16Uiu2HAm4x9XwJQGWYSn7QiQ1BvteYoDBgvFj9Mk18JwWWrcQR1g", "port": "/ip4/10.0.31.50/tcp/13000"}

2021-07-18T16:54:08.158221Z INFO    p2p/discovery.go:297    using external IP   {"app": "SSV-Node", "id": "16Uiu2HAm4x9XwJQGWYSn7QiQ1BvteYoDBgvFj9Mk18JwWWrcQR1g", "IP from config": "18.118.42.57", "IP": "18.118.42.57"}

2021-07-18T16:54:08.159534Z INFO    p2p/discovery.go:222    ENR {"app": "SSV-Node", "id": "16Uiu2HAm4x9XwJQGWYSn7QiQ1BvteYoDBgvFj9Mk18JwWWrcQR1g", "enr": "enr:-Jy4QN3SG99EmpxyrjcDWoDTmjipMbzRjS7W_-XP3VyvP5oyUyhMcoOcP6FmGOoqEVCN6kWsFQRn3gzyyro_98PLqoMBh2F0dG5ldHOIAAAAAAAAAACCaWSCdjSCaXCEEnYqOYlzZWNwMjU2azGhAo2HwyI3rxrTne_7mUWgsZ9xOpn7m9Lb8ipXQmOLnwOvg3RjcIIyyIN1ZHCCLuA"}

2021-07-18T16:54:08.160433Z ERROR   p2p/p2p.go:172  IP address is not accessible    {"app": "SSV-Node", "id": "16Uiu2HAm4x9XwJQGWYSn7QiQ1BvteYoDBgvFj9Mk18JwWWrcQR1g", "error": "dial tcp 18.118.42.57:13000: connect: connection refused"}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x145a435]

goroutine 1 [running]:

github.com/bloxapp/ssv/network/p2p.New(0x1d78ea0, 0xc000040038, 0xc000b69c80, 0x2881c08, 0x6, 0xc0006a1ce0, 0x26, 0xc00069c870)

    /go/src/github.com/bloxapp/ssv/network/p2p/p2p.go:174 +0x1315

github.com/bloxapp/ssv/cli/operator.glob..func1(0x276a880, 0xc00068ee60, 0x0, 0x1)

    /go/src/github.com/bloxapp/ssv/cli/operator/node.go:82 +0x7ba

github.com/spf13/cobra.(*Command).execute(0x276a880, 0xc00068ee50, 0x1, 0x1, 0x276a880, 0xc00068ee50)

    /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:854 +0x2c2

github.com/spf13/cobra.(*Command).ExecuteC(0x2769620, 0x7, 0x1a28fa8, 0x1)

    /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:958 +0x375

github.com/spf13/cobra.(*Command).Execute(...)

    /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:895

github.com/bloxapp/ssv/cli.Execute(0x1a2eebc, 0x8, 0x1d1fd00, 0x6)

    /go/src/github.com/bloxapp/ssv/cli/cli.go:29 +0xa5

main.main()

    /go/src/github.com/bloxapp/ssv/cmd/ssvnode/main.go:16 +0x51

make: *** [Makefile:61: start-node] Error 2

The setup looks like this, and with the host commented out and going through the ingress load balancer, this failure is rare. When going for host, it happens every time.

    ports:
      - protocol: tcp
        published: 13000
        target: 13000
#        mode: host
      - protocol: udp
        published: 12000
        target: 12000
#        mode: host

Which begs the question whether this is really an SSV bug. I believe it is, because:

It happens without host mode, just more rarely
I am using host mode with Chainlink nodes without any issues

Rationale / why is host mode desirable: host mode speeds up connections, as it's direct to host instead of going through LB. It's a networking optimization. As well, failures like this should not happen, regardless of latency of networking - and right now they do happen in both modes, just rarely without host.

stefa2k commented 3 years ago

I can't get the node to work anymore, regardless of docker-compose down, removing data, etc. It's completely stuck with setup like this: https://github.com/stereum-dev/ethereum2-docker-compose/blob/prater/compose-examples/lighthouse-only/override-examples/docker-compose.ssv-no-geth.override.yaml#L20 (no host-network, just a plain docker network with ports exposed)

yorickdowne commented 3 years ago

Unrelated to your issue: You may want to make sure 12,000 is udp not tcp in your compose file.

I think the only reason this works for me is that I am in docker swarm and there's a slight delay when querying the port through ingress, which allows the node to "come up" enough to respond. At least that's my theory. I can't see why it would fail intermittently with an overlay network and consistently with a host network, otherwise.

I think your setup is a host network: Plain docker doesn't have overlay networks and ingress routing.

yorickdowne commented 3 years ago

Resolved with 0.1.2

ssvlabs / ssv

Bug: Failure when testing tcp port - potential race condition? #199