Open laurentsenta opened 2 years ago
Random panics on tests:
trying a few tweaks with Piotr, we cleaned a few logs and added a sleep to identify where the ping issue comes from:
Turns out the sdk-go will panic at the same time we log the message, in the instance that is doing the sleeping:
Apr 26 14:42:20.384615 [34mINFO[0m 2.1270s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m 192.18.0.8 not in data subnet 16.0.0.0/16, ignoring
Apr 26 14:42:20.384652 [34mINFO[0m 2.1270s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m detected data network IP: 16.0.0.3/16
Apr 26 14:42:25.492001 [34mINFO[0m 7.2337s [30;47m MESSAGE[0m [32m<< single[000] (794fb5) >>[0m my listen addrs: [/ip4/16.0.0.2/tcp/38975]
Apr 26 14:42:25.641978 [34mINFO[0m 7.3836s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m my listen addrs: [/ip4/16.0.0.3/tcp/34021]
Apr 26 14:42:25.694571 [34mINFO[0m 7.4363s [30;47m MESSAGE[0m [32m<< single[000] (794fb5) >>[0m done dialling my peers
Apr 26 14:42:25.694582 [34mINFO[0m 7.4362s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m A, {QmRS9UyhNGWj5yJ2JNNya7KLH4fKEhQ8XyYzM7wvg99SiQ: [/ip4/16.0.0.2/tcp/38975]}
Apr 26 14:42:25.694832 [34mINFO[0m 7.4363s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m Dial peer: QmRS9UyhNGWj5yJ2JNNya7KLH4fKEhQ8XyYzM7wvg99SiQ
Apr 26 14:42:25.695027 [34mINFO[0m 7.4364s [30;47m MESSAGE[0m [33m<< single[001] (7ed2e0) >>[0m STARTED SLEEPING, {QmRS9UyhNGWj5yJ2JNNya7KLH4fKEhQ8XyYzM7wvg99SiQ: [/ip4/16.0.0.2/tcp/38975]}
Apr 26 14:42:25.696648 [34mINFO[0m 7.4392s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m panic: send on closed channel
Apr 26 14:42:25.696754 [34mINFO[0m 7.4393s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m
Apr 26 14:42:25.696837 [34mINFO[0m 7.4394s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m goroutine 45 [running]:
Apr 26 14:42:25.696910 [34mINFO[0m 7.4395s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc0000ea2a0)
Apr 26 14:42:25.696990 [34mINFO[0m 7.4395s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 26 14:42:25.697070 [34mINFO[0m 7.4396s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m created by github.com/testground/sdk-go/sync.newClient
Apr 26 14:42:25.697204 [34mINFO[0m 7.4398s [37;41m ERROR[0m [33m<< single[001] (7ed2e0) >>[0m /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7
Apr 26 14:42:25.805524 [34mINFO[0m 7.5481s [37;101mINCOMPLETE[0m [33m<< single[001] (7ed2e0) >>[0m
(ubuntu server, task_id=c9k09hiel22hrtvt0ps0)
So it looks like the code might fail around line 233 and 234 here:
In another run:
in this run: node 000
sleeps, and node 001
panic at the same time.
Apr 26 14:50:21.556652 [34mINFO[0m 2.0287s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m Hello friends test 001
Apr 26 14:50:21.556886 [34mINFO[0m 2.0290s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m 127.0.0.1 not in data subnet 16.0.0.0/16, ignoring
Apr 26 14:50:21.556919 [34mINFO[0m 2.0290s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m 192.18.0.8 not in data subnet 16.0.0.0/16, ignoring
Apr 26 14:50:21.556946 [34mINFO[0m 2.0291s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m detected data network IP: 16.0.0.3/16
Apr 26 14:50:26.701199 [34mINFO[0m 7.1725s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m my listen addrs: [/ip4/16.0.0.2/tcp/34265]
Apr 26 14:50:26.770436 [34mINFO[0m 7.2419s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m my listen addrs: [/ip4/16.0.0.3/tcp/35305]
Apr 26 14:50:26.804009 [34mINFO[0m 7.2755s [30;47m MESSAGE[0m [33m<< single[001] (87ae7a) >>[0m done dialling my peers
Apr 26 14:50:26.804284 [34mINFO[0m 7.2759s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m A, {QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY: [/ip4/16.0.0.3/tcp/35305]}
Apr 26 14:50:26.804433 [34mINFO[0m 7.2760s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m Dial peer: QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY
Apr 26 14:50:26.804559 [34mINFO[0m 7.2761s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m STARTED SLEEPING, {QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY: [/ip4/16.0.0.3/tcp/35305]}
Apr 26 14:50:26.807311 [34mINFO[0m 7.2796s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m panic: send on closed channel
Apr 26 14:50:26.807432 [34mINFO[0m 7.2798s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m
Apr 26 14:50:26.807614 [34mINFO[0m 7.2800s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m goroutine 57 [running]:
Apr 26 14:50:26.807690 [34mINFO[0m 7.2800s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m github.com/testground/sdk-go/sync.(*DefaultClient).responsesWorker(0xc0002ea000)
Apr 26 14:50:26.807766 [34mINFO[0m 7.2801s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client_conn.go:43 +0x285
Apr 26 14:50:26.807837 [34mINFO[0m 7.2802s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m created by github.com/testground/sdk-go/sync.newClient
Apr 26 14:50:26.807931 [34mINFO[0m 7.2803s [37;41m ERROR[0m [33m<< single[001] (87ae7a) >>[0m /go/pkg/mod/github.com/testground/sdk-go@v0.3.1-0.20211012114808-49c90fa75405/sync/client.go:118 +0x1e7
Apr 26 14:50:26.917452 [34mINFO[0m 7.3898s [37;101mINCOMPLETE[0m [33m<< single[001] (87ae7a) >>[0m
Apr 26 14:50:56.804922 [34mINFO[0m 37.2763s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m STOPPED SLEEPING, {QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY: [/ip4/16.0.0.3/tcp/35305]}
Apr 26 14:51:01.806263 [34mINFO[0m 42.2778s [30;47m MESSAGE[0m [32m<< single[000] (8f6e1f) >>[0m FAILED CONNECT, {QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY: [/ip4/16.0.0.3/tcp/35305]}
Apr 26 14:51:01.806660 [34mINFO[0m 42.2783s [37;41m FAIL[0m [32m<< single[000] (8f6e1f) >>[0m failed to dial QmPcscAmcDrxnMuKvQzWUc9QZbMkdGv54iZJX8bmhjp3vY:
* [/ip4/16.0.0.3/tcp/35305] dial tcp4 16.0.0.3:35305: i/o timeout
Apr 26 14:51:03.749465 [34mINFO[0m deleting containers {"runner": "local:docker", "run_id": "c9k0d9qel22hrtvt0pu0", "ids": ["8f6e1fdfa5c8958f2f8e42ecfaac6b8726a5e287c1deee43bb47e12ac4ff4260", "87ae7a231f00db2fc586b10392cfb9d358064440e9c50d4177a37291e23243ff"]}
Apr 26 14:51:03.749525 [34mINFO[0m deleting container {"runner": "local:docker", "run_id": "c9k0d9qel22hrtvt0pu0", "id": "87ae7a231f00db2fc586b10392cfb9d358064440e9c50d4177a37291e23243ff"}
Apr 26 14:51:03.749538 [34mINFO[0m deleting container {"runner": "local:docker", "run_id": "c9k0d9qel22hrtvt0pu0", "id": "8f6e1fdfa5c8958f2f8e42ecfaac6b8726a5e287c1deee43bb47e12ac4ff4260"}
Apr 26 14:51:05.899545 [33mWARN[0m run finished in error {"run_id": "c9k0d9qel22hrtvt0pu0", "plan": "ping", "case": "ping", "runner": "local:docker", "instances": 2, "error": "2 nodes failed"}
(ubuntu server ?task_id=c9k0d9qel22hrtvt0pu0)
sync service:
Apr 26 14:50:26.834164 WARN websocket closed unexpectedly: failed to read JSON message: failed to get reader: failed to read frame header: EOF
panic
error that happens when you run the ping test in: https://github.com/libp2p/test-plans/pull/23