laurentsenta / pl-github

0 stars 1 forks source link

EPIC1.1: As a go-libp2p maintainer I have all the tooling I need to create new interoperability tests on my machine and make sure they are used in the CI to test every release #1

Open laurentsenta opened 2 years ago

laurentsenta commented 2 years ago

Outcome:

My main goal with this task is to have at least a single ping test that we can implement in many libp2p versions & implementation which would let us demonstrate the basic of interop testing with testground.

Tasks:

laurentsenta commented 2 years ago

Next step:

laurentsenta commented 2 years ago

Notes:

We updated the test-plans/ping example a few days ago, but it fails with:

Step 15/22 : RUN cd ${PLAN_DIR}     && go env -w GOPROXY="${GO_PROXY}"     && CGO_ENABLED=${CgoEnabled} GOOS=linux GOARCH=amd64 go build -o ${PLAN_DIR}/testplan.bin ${BUILD_TAGS} ${TESTPLAN_EXEC_PKG}
 ---> Running in 06bbb71337a7
# github.com/libp2p/test-plans/ping
./main.go:127:25: cannot use ctx (type context.Context) as type config.Option in argument to libp2p.New
Apr 25 10:12:21.121748  INFO   build failed    {"plan": "ping", "groups": ["single"], "builder": "docker:go", "error": "docker build failed: The command '/bin/sh -c cd ${PLAN_DIR}     && go env -w GOPROXY=\"${GO_PROXY}\"     && CGO_ENABLED=${CgoEnabled} GOOS=linux GOARCH=amd64 go build -o ${PLAN_DIR}/testplan.bin ${BUILD_TAGS} ${TESTPLAN_EXEC_PKG}' returned a non-zero code: 2"}

When I run it on my machine with:

testground run single --plan libp2p/ping --testcase=ping --builder=docker:go --runner=local:docker --instances=10 --collect --build-cfg build_base_image=golang:1.17-buster

I think the code is invalid (we didn't upgrade the code), prep'ing a patch.

PR created here: https://github.com/libp2p/test-plans/pull/23

Patch is here: https://github.com/libp2p/test-plans/pull/23/commits/6010ebdcc121ae29b2c4484273a246a3d01c5792

Build is passing now, see test below:

laurentsenta commented 2 years ago

Test run attempt 1:

....
Apr 25 10:21:27.684825  INFO   2.1978s      START << single[000] (202090) >> {"plan":"ping","case":"ping","run":"c9j7butvqc7liju56i10","outputs_path":"/outputs","instances":10,"params":{"iterations":"5","max_latency_ms":"1000","secure_channel":"noise"},"group":"single","group_instances":10,"network":"16.0.0.0/16","start_time":"0001-01-01T00:00:00Z"}
Apr 25 10:21:27.685334  INFO   2.1984s    MESSAGE << single[003] (e88659) >> waiting for network to initialize
Apr 25 10:21:27.690673  INFO   2.2040s    MESSAGE << single[000] (202090) >> waiting for network to initialize
[BLOCKED HERE FOR A FEW MINUTES]

After 5 minutes: testground terminate --runner local:docker

then restart testground and try again.

laurentsenta commented 2 years ago

Test run attempt 2: Same issue. I know this test should pass so I'll reset my docker state:

Attempt 3: Stuck on network init too, got killed after 10 min "context deadline exceeded'. Rebooting machine.

Attempt 4: We need the latest SDK to run with testground master, upgrading with:

go get -u github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405 
go mod tidy

Stuck at:

Apr 25 11:47:57.234619  INFO   performing healthcheck on runner

Attempt 5: Run finish with error, connection reset by peer: http://localhost:8042/logs?task_id=c9j8lldvqc7irdvpai10

Apr 25 11:49:50.735793  INFO   0.6274s    MESSAGE << single[000] (c8ecb9) >> waiting for network to initialize
Apr 25 11:49:50.738112  INFO   0.6287s    MESSAGE << single[000] (c8ecb9) >> network initialisation successful
Apr 25 11:49:50.738170  INFO   0.6296s    MESSAGE << single[000] (c8ecb9) >> network initialization complete
Apr 25 11:49:50.738209  INFO   0.6306s    MESSAGE << single[000] (c8ecb9) >> claimed sequence numbers; global=2, group(single)=2
Apr 25 11:49:50.744234  INFO   0.6305s    MESSAGE << single[001] (034c56) >> 127.0.0.1 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:49:50.744311  INFO   0.6313s    MESSAGE << single[001] (034c56) >> detected data network IP: 16.1.0.2/16
Apr 25 11:49:50.745254  INFO   0.6310s    MESSAGE << single[000] (c8ecb9) >> started test instance; params: secure_channel=noise, max_latency_ms=1000, iterations=5
Apr 25 11:49:50.745297  INFO   0.6323s    MESSAGE << single[000] (c8ecb9) >> 127.0.0.1 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:49:50.745349  INFO   0.6328s    MESSAGE << single[000] (c8ecb9) >> 192.18.0.7 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:49:50.745383  INFO   0.6332s    MESSAGE << single[000] (c8ecb9) >> detected data network IP: 16.1.0.3/16
Apr 25 11:49:50.985970  INFO   0.8778s    MESSAGE << single[001] (034c56) >> my listen addrs: [/ip4/16.1.0.2/tcp/44701]
Apr 25 11:49:51.093992  INFO   0.9862s    MESSAGE << single[000] (c8ecb9) >> my listen addrs: [/ip4/16.1.0.3/tcp/33289]
Apr 25 11:49:51.191568  INFO   1.0784s      ERROR << single[000] (c8ecb9) >> panic: send on closed channel
Apr 25 11:49:51.191621  INFO   1.0785s      ERROR << single[000] (c8ecb9) >> 
Apr 25 11:49:51.191696  INFO   1.0786s      ERROR << single[000] (c8ecb9) >> goroutine 32 [running]:
Apr 25 11:49:51.191715  INFO   1.0786s      ERROR << single[000] (c8ecb9) >> github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc000812000)
Apr 25 11:49:51.191732  INFO   1.0786s      ERROR << single[000] (c8ecb9) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 25 11:49:51.191752  INFO   1.0786s      ERROR << single[000] (c8ecb9) >> created by github.com/testground/sdk-go/sync.newClient
Apr 25 11:49:51.191772  INFO   1.0786s      ERROR << single[000] (c8ecb9) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7
Apr 25 11:49:51.197629  INFO   1.0899s       FAIL << single[001] (034c56) >> failed to dial QmXs5kcZeWfyAK5TZ8k3uQpRGSN94w6Fam6um4iwoxBhjP:
  * [/ip4/16.1.0.3/tcp/33289] failed to negotiate security protocol: read tcp4 16.1.0.2:44701->16.1.0.3:33289: read: connection reset by peer
Apr 25 11:49:51.234489  INFO   1.1213s INCOMPLETE << single[000] (c8ecb9) >> Hello this is me.
Apr 25 11:49:51.275686  INFO   deleting containers {"runner": "local:docker", "run_id": "c9j8lldvqc7irdvpai10", "ids": ["c8ecb9f75790fea8e21eba347d999608bc642051b39ed6bc16099e35a4d6966f", "034c5666ef9866c8a9b45b0eb3bc1a014a1066efc8b447c1a95141930caf7256"]}
Apr 25 11:49:51.275759  INFO   deleting container  {"runner": "local:docker", "run_id": "c9j8lldvqc7irdvpai10", "id": "034c5666ef9866c8a9b45b0eb3bc1a014a1066efc8b447c1a95141930caf7256"}
Apr 25 11:49:51.275762  INFO   deleting container  {"runner": "local:docker", "run_id": "c9j8lldvqc7irdvpai10", "id": "c8ecb9f75790fea8e21eba347d999608bc642051b39ed6bc16099e35a4d6966f"}
Apr 25 11:49:51.695494  WARN   run finished in error   {"run_id": "c9j8lldvqc7irdvpai10", "plan": "ping", "case": "ping", "runner": "local:docker", "instances": 2, "error": "2 nodes failed"}

Attempt 6: Passed http://localhost:8042/logs?task_id=c9j8m6lvqc7irdvpai1g

Attempt 7: Failed (connection reset by peer)

http://localhost:8042/logs?task_id=c9j8mbdvqc7irdvpai20

Apr 25 11:51:10.896014  INFO   0.7233s    MESSAGE << single[001] (d127e5) >> claimed sequence numbers; global=2, group(single)=2
Apr 25 11:51:10.896342  INFO   0.7240s    MESSAGE << single[001] (d127e5) >> started test instance; params: secure_channel=noise, max_latency_ms=1000, iterations=5
Apr 25 11:51:10.896363  INFO   0.7239s    MESSAGE << single[000] (ab6232) >> 192.18.0.8 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:51:10.898144  INFO   0.7256s    MESSAGE << single[000] (ab6232) >> detected data network IP: 16.1.0.2/16
Apr 25 11:51:10.898550  INFO   0.7260s    MESSAGE << single[001] (d127e5) >> 127.0.0.1 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:51:10.899346  INFO   0.7269s    MESSAGE << single[001] (d127e5) >> 192.18.0.7 not in data subnet 16.1.0.0/16, ignoring
Apr 25 11:51:10.899938  INFO   0.7275s    MESSAGE << single[001] (d127e5) >> detected data network IP: 16.1.0.3/16
Apr 25 11:51:11.117454  INFO   0.9447s    MESSAGE << single[001] (d127e5) >> my listen addrs: [/ip4/16.1.0.3/tcp/40599]
Apr 25 11:51:11.233782  INFO   1.0608s    MESSAGE << single[000] (ab6232) >> my listen addrs: [/ip4/16.1.0.2/tcp/34687]
Apr 25 11:51:11.323531  INFO   1.1311s      ERROR << single[000] (ab6232) >> panic: send on closed channel
Apr 25 11:51:11.323648  INFO   1.1312s      ERROR << single[000] (ab6232) >> 
Apr 25 11:51:11.323683  INFO   1.1312s      ERROR << single[000] (ab6232) >> goroutine 68 [running]:
Apr 25 11:51:11.323699  INFO   1.1312s      ERROR << single[000] (ab6232) >> github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc0007d6000)
Apr 25 11:51:11.323714  INFO   1.1312s      ERROR << single[000] (ab6232) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 25 11:51:11.323729  INFO   1.1313s      ERROR << single[000] (ab6232) >> created by github.com/testground/sdk-go/sync.newClient
Apr 25 11:51:11.323753  INFO   1.1313s      ERROR << single[000] (ab6232) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7
Apr 25 11:51:11.331099  INFO   1.1575s       FAIL << single[001] (d127e5) >> failed to dial QmQy5Qd3KiubssawG6g1aHMSbYBrb2T1LfFYLtzxK6YUEy:
  * [/ip4/16.1.0.2/tcp/34687] failed to negotiate security protocol: read tcp4 16.1.0.3:40599->16.1.0.2:34687: read: connection reset by peer
laurentsenta commented 2 years ago

Running the ping test fails with a connection reset by peer:

Apr 25 11:53:59.949352  INFO   0.7741s    MESSAGE << single[000] (833a6d) >> my listen addrs: [/ip4/16.1.0.2/tcp/33205]
Apr 25 11:54:00.023808  INFO   0.8489s    MESSAGE << single[001] (c8358f) >> my listen addrs: [/ip4/16.1.0.3/tcp/44305]
Apr 25 11:54:00.054531  INFO   0.8676s      ERROR << single[001] (c8358f) >> panic: send on closed channel
Apr 25 11:54:00.054606  INFO   0.8677s      ERROR << single[001] (c8358f) >> 
Apr 25 11:54:00.054635  INFO   0.8677s      ERROR << single[001] (c8358f) >> goroutine 83 [running]:
Apr 25 11:54:00.054658  INFO   0.8677s      ERROR << single[001] (c8358f) >> github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc0001cc150)
Apr 25 11:54:00.054680  INFO   0.8677s      ERROR << single[001] (c8358f) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 25 11:54:00.054701  INFO   0.8678s      ERROR << single[001] (c8358f) >> created by github.com/testground/sdk-go/sync.newClient
Apr 25 11:54:00.054734  INFO   0.8678s      ERROR << single[001] (c8358f) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7
Apr 25 11:54:00.061246  INFO   0.8862s       FAIL << single[000] (833a6d) >> failed to dial QmdN1UNLv3kz5AmZc8n7sc7jU71Xr5WMY25oQHGCaB392F:
  * [/ip4/16.1.0.3/tcp/44305] failed to negotiate security protocol: read tcp4 16.1.0.2:33205->16.1.0.3:44305: read: connection reset by peer
Apr 25 11:54:00.105339  INFO   0.9184s INCOMPLETE << single[001] (c8358f) >> Hello this is me.
Apr 25 11:54:00.129215  INFO   deleting containers {"runner": "local:docker", "run_id": "c9j8nllvqc7irdvpai4g", "ids": ["833a6d0b1a5b3c6bbcac78a72f83f694fdfde38ff267c264c27f07b5b24de62b", "c8358fafcf594aa9f4348b8b10723646c2f5cdf3a88b3ec6beaa8d30c359f5d0"]}
Apr 25 11:54:00.129280  INFO   deleting container  {"runner": "local:docker", "run_id": "c9j8nllvqc7irdvpai4g", "id": "c8358fafcf594aa9f4348b8b10723646c2f5cdf3a88b3ec6beaa8d30c359f5d0"}
Apr 25 11:54:00.129352  INFO   deleting container  {"runner": "local:docker", "run_id": "c9j8nllvqc7irdvpai4g", "id": "833a6d0b1a5b3c6bbcac78a72f83f694fdfde38ff267c264c27f07b5b24de62b"}
Apr 25 11:54:00.419212  WARN   run finished in error   {"run_id": "c9j8nllvqc7irdvpai4g", "plan": "ping", "case": "ping", "runner": "local:docker", "instances": 2, "error": "2 nodes failed"}

Something looks weird here: Both node are trying to call the other. I think this predicate is wrong: https://github.com/libp2p/test-plans/blob/8497c05980adb9651aad78b15d515bf97f947434/ping/main.go#L214-L228 Probably due to a change in the sync

That was incorrect: the test is valid.

laurentsenta commented 2 years ago

I keep getting the error:

Apr 25 12:12:52.713158  INFO   2.0471s    MESSAGE << single[001] (851971) >> my listen addrs: [/ip4/16.1.0.3/tcp/45695]
Apr 25 12:12:52.780498  INFO   2.1142s    MESSAGE << single[000] (6a66ef) >> GlobalSeq: 2, hence skip connecting to peers: [{QmcQ1BLQBt3HnjnJACRHN87ZendJ5z4yPTcUhWuusRSYVT: [/ip4/16.1.0.3/tcp/45695]}]
Apr 25 12:12:52.780506  INFO   2.1141s    MESSAGE << single[001] (851971) >> GlobalSeq: 1, hence connecting to peers: [{QmZJjWHkcaz2gQVisg211tgXT8mVRkanpfkD8HP2UQyT42: [/ip4/16.1.0.2/tcp/45355]}]
Apr 25 12:12:52.780657  INFO   2.1150s    MESSAGE << single[000] (6a66ef) >> done dialling my peers
Apr 25 12:12:52.782550  INFO   2.1093s      ERROR << single[001] (851971) >> panic: send on closed channel
Apr 25 12:12:52.782629  INFO   2.1094s      ERROR << single[001] (851971) >> 
Apr 25 12:12:52.782747  INFO   2.1096s      ERROR << single[001] (851971) >> goroutine 72 [running]:
Apr 25 12:12:52.782836  INFO   2.1096s      ERROR << single[001] (851971) >> github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc0007aa0e0)
Apr 25 12:12:52.782894  INFO   2.1097s      ERROR << single[001] (851971) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 25 12:12:52.782924  INFO   2.1097s      ERROR << single[001] (851971) >> created by github.com/testground/sdk-go/sync.newClient
Apr 25 12:12:52.782987  INFO   2.1098s      ERROR << single[001] (851971) >>  /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7

Which looks due to sdk-go actually. Response Worker lives in sdk-go/sync/client_conn.go.

I'm not sure how to get a full trace of every go process running. I guess this is caused by signaling the MustSignalAndWait("connected") to the sync service.

I get the following logs in the sync service:

Apr 25 12:20:53.552491  WARN    websocket closed unexpectedly: failed to read JSON message: failed to get reader: failed to read frame header: EOF
Apr 25 12:20:53.555507  INFO    client closed connection

It's hard to tell what is the source of this error.

Attempt 10: Clear containers (docker rm ...) and restart testground. Run test => pass. Then it starts failing again.

Attempt 11: Restart and re-run and it starts failing immediately. Trying another clear + restart then run testground healthcheck --runner local:docker --fix before starting the test (will boot all the other containers so they have time to init).

Immediate fail.

Attempt 12: (skipping many other attempts in between TBH)

Not trying to run the healthcheck, The test is in failed state despite ending successfully:

http://localhost:8042/logs?task_id=c9j99ulvqc7m1evkon1g

Attempt 13: I get JSON error in the sync service logs and immediate panic in the test logs.

laurentsenta commented 2 years ago

Will work on https://github.com/laurentsenta/pl-github/issues/10 for a while so we have more people to test this out.

laurentsenta commented 2 years ago

Note: running the test plans on a dedicated ubuntu machine works consistently, so (my) macos setup might be the issue here.

=> update, now it fails with a panic. We need some tools to debug the go sdk.

laurentsenta commented 2 years ago

Note: there is a PR in libp2p test plan for the whole feature. We're going to "accept" instability for now

laurentsenta commented 2 years ago

Note: I retried running the tests with the current k8s cluster, but it failed because of

 ---> Running in 0be5bca8f945
Using go proxy: http://localhost:8081
go mod download: lukechampine.com/blake3@v1.1.7: reading http://localhost:8081/lukechampine.com/blake3/@v/v1.1.7.info: 404 Not Found
    server response:
    go mod download -json lukechampine.com/blake3@v1.1.7:
    {
        "Path": "lukechampine.com/blake3",
        "Version": "v1.1.7",
        "Error": "lukechampine.com/blake3@v1.1.7: unrecognized import path \"lukechampine.com/blake3\": https fetch: Get \"https://lukechampine.com/blake3?go-get=1\": dial tcp 23.92.16.31:443: connect: no route to host"
    }
May  2 09:54:29.223214  INFO   build failed    {"plan": "ping", "groups": ["single"], "builder": "docker:go", "error": "docker build failed: The command '/bin/sh -c echo \"Using go proxy: ${GO_PROXY}\"     && cd ${PLAN_DIR}     && go env -w GOPROXY=\"${GO_PROXY}\"     && go mod download' returned a non-zero code: 1"}

Will raise this later.