prestonvanloon commented 2 years ago

🐞 Bug Report

Description

p2p tests are extremely flaky. These tests typically pass around one in ten runs.

I ran this test with --test_strategy=exclusive --runs_per_test=500 overnight and the result is as follows:

//beacon-chain/p2p:go_default_test                                       FAILED in 1218 out of 1355 in 21.5s

Has this worked before in a previous version?

This has been an issue for some time. It passes in CI as this package is rarely changed and a passing test is cached between CI requests.

🔬 Minimal Reproduction

bazel or go test

bazel test //beacon-chain/p2p:go_default_test

🔥 Error

--- FAIL: TestVerifyConnectivity (0.00s)
    --- FAIL: TestVerifyConnectivity/Dialing_a_reachable_IP:_142.250.68.46:80 (0.00s)
        assertions.go:164: utils_test.go:33 Unexpected log found: IP address is not accessible
            Searched logs:
            [time="2022-01-31T15:45:05Z" level=warning msg="IP address is not accessible" address="142.250.68.46:80" error="dial tcp 142.250.68.46:80: connect: network is unreachable" prefix=p2p protocol=tcp
            ]

--- FAIL: TestService_BroadcastAttestationWithDiscoveryAttempts (4.09s)                                                                       
    broadcaster_test.go:363: Failed to receive pubsub within 4s                                                                               
    assertions.go:86: broadcaster_test.go:351 Unexpected error: context deadline exceeded

🌍 Your Environment

Operating System:

  
Linux - Ubuntu

What version of Prysm are you running? (Which release)

Latest develop commit

Anything else relevant (validator index / public key)?

patterns commented 2 years ago

I'm interested in helping on this task. I just followed the Contributer guide and got a local environment to finish bazelisk test //beacon-chain/node:go_default_test. So I think I have the baseline ready.

patterns commented 2 years ago

Reading the test in TestVerifyConnectivity, some initial thoughts are:

with google.com as the destination, there may be multiple hops to reach that proxy (or load balancer) and it can implement a back-off strategy (to fend off a flood of requests)
the verifyConnectivity function uses a static dial-timeout of one second which may be easily exceeded under high latency
at port 80 destination may be redirected to 443 (TLS protocol port)
at --runs_per_test=500 and 137 failures, something connects (we prefer 500 or 250 failures to be consistent)

The next steps I can attempt are to gather data to narrow down what is contributing to the failures. For example, choosing a destination that expressly accepts test requests, increasing delay between test run iterations, increasing dial-timeout, or try UDP port 53 (dns).

nisdas commented 2 years ago

Hey @patterns , thanks for taking a look at this ! Just assigned you the issue. It seems you already have made good progress on this :) , looking forward to the improvements to these tests.

patterns commented 2 years ago

Very appreciated, thanks! I think I have the TestVerifyConnectivity changes for a unit test. I borrowed most of the the code from the dial_test.go in the standard lib's net package and realized when looking there that they distinguish between available-external-network (and "local") for the test environment. That made me wonder whether TestVerifyConnectivity is meant to exercise the net.Dial() because it shouldn't be necessary to test that net.Dial() works. It made me think that the purpose of the test is really to verify/prove that a log entry is generated under the right circumstances (unreachable addresses). This seems reasonable because when I looked for where it is called, it is to do quick "health check" on the Host-Address config param (line#266 in service.go). And I think the param option is the p2p-host-IP:

--p2p-host-ip value The IP address advertised by libp2p. This may be used to advertise an external IP.

I need to confirm whether is correct. At the same time, I'll try to understand the second failing test (Discovery-Attempts one).

utils_test.go.txt

nisdas commented 2 years ago

Hey @patterns ,

That is correct, the TestVerifyConnectivity is used to verify that dials to external IPs work as expected.

prysmaticlabs / prysm

Flaky test //beacon-chain/p2p:go_default_test #10160

🐞 Bug Report

Description

Has this worked before in a previous version?

🔬 Minimal Reproduction

🔥 Error

🌍 Your Environment