Closed jlevison closed 5 years ago
More clarification regarding issue 3:
Currently in the system we have various protections against deadlocks with timeouts.. if an operation takes obscene amount of time - we timeout/panic
This behavior is right for production (e2e) but IMO is wrong for acceptance and component tests
If we reach a deadlock in these tests, I prefer to let the system deadlock and reach the test timeout panic which will print stack traces of all goroutines. These stack traces are the main thing that helps with debugging deadlocks. By panicking manually first we miss out on the chance to easily get these stack traces which are essential to debugging
Re: logs - there's a tradeoff, as in most cases you don't want all of the logs. Is it problematic to manually change (and not commit) the logger definition? if so, maybe we can add an external log config file (but I really hate this idea!) or - better - pass logger override directives via environment variables?
Re: timeout - we need to split the acceptance suite; there are the essential tests, that check logic, which need to have a short timeout; then there are stress tests that fuck up gossip so see how the system behaves; these tests run 100 times, so they need a longer timeout; I think that what we need is to run these tests in a separate build phase than normal acceptance. Thoughts?
Re: 3 - please explain what's the current flaw? I'm not sure I follow what exactly you would like to see. I'm not sure we can use the public api's blocking behavior in all use cases - am I wrong?
re: re: 3, see https://circleci.com/gh/orbs-network/orbs-network-go/2456 log is below but if the test itself timed out, we should have gotten all the goroutines and not just the paniced one, probably the problem with that failure is a system deadlock and we cannot debug it now Moreover, we cannot reproduce it as its a flakiness issue :/
Attaching to project_orbs-acceptance_1
orbs-acceptance_1 | + source ./test.common.sh
orbs-acceptance_1 | + '[' '!=' ]
orbs-acceptance_1 | + NO_LOG_STDOUT=true go test ./test/acceptance -count 100 -timeout 20m -failfast
orbs-acceptance_1 | + check_exit_code_and_report
orbs-acceptance_1 | + export 'EXIT_CODE=1'
orbs-acceptance_1 | + '[' 1 '!=' 0 ]
orbs-acceptance_1 | + grep -B 150 -A 15 -- FAIL: test.out
orbs-acceptance_1 | + cat fail.out
orbs-acceptance_1 | --- FAIL: TestCreateGazillionTransactionsWhileTransportIsDelayingRandomMessages (1.04s)
orbs-acceptance_1 | panic: timed out waiting for transaction with hash ac3178c9826cc697d104c0d5f603c7fab18aecd2b1d435dcffd89530d7ae2b77 [recovered]
orbs-acceptance_1 | panic: timed out waiting for transaction with hash ac3178c9826cc697d104c0d5f603c7fab18aecd2b1d435dcffd89530d7ae2b77
orbs-acceptance_1 |
orbs-acceptance_1 | goroutine 12266 [running]:
orbs-acceptance_1 | testing.tRunner.func1(0xc4203c4690)
orbs-acceptance_1 | /usr/local/go/src/testing/testing.go:742 +0x29d
orbs-acceptance_1 | panic(0x986b40, 0xc420439240)
orbs-acceptance_1 | /usr/local/go/src/runtime/panic.go:502 +0x229
orbs-acceptance_1 | github.com/orbs-network/orbs-network-go/test/harness/services/blockstorage/adapter.(*inMemoryBlockPersistence).WaitForTransaction(0xc420244740, 0xa87480, 0xc420244640, 0xc420b06400, 0x20, 0x20, 0x3b9aca00, 0x20)
orbs-acceptance_1 | /go/src/github.com/orbs-network/orbs-network-go/test/harness/services/blockstorage/adapter/memory_persistence.go:57 +0x37f
orbs-acceptance_1 | github.com/orbs-network/orbs-network-go/test/harness.(*inProcessNetwork).WaitForTransactionInStateForAtMost(0xc420355e50, 0xa87480, 0xc420244640, 0x0, 0xc420b06400, 0x20, 0x20, 0x3b9aca00)
orbs-acceptance_1 | /go/src/github.com/orbs-network/orbs-network-go/test/harness/network.go:85 +0xa5
orbs-acceptance_1 | github.com/orbs-network/orbs-network-go/test/harness.(*inProcessNetwork).WaitForTransactionInState(0xc420355e50, 0xa87480, 0xc420244640, 0x0, 0xc420b06400, 0x20, 0x20)
orbs-acceptance_1 | /go/src/github.com/orbs-network/orbs-network-go/test/harness/network.go:81 +0x73
orbs-acceptance_1 | github.com/orbs-network/orbs-network-go/test/acceptance.sendTransfersAndAssertTotalBalance(0xa87480, 0xc420244640, 0xa898e0, 0xc420355e50, 0xc4203c4690, 0x64)
orbs-acceptance_1 | + grep -B 150 -A 150 -- 'test timed out' test.out
orbs-acceptance_1 | + cat timed.out
orbs-acceptance_1 | + '[' '!' -s fail.out ]
orbs-acceptance_1 | + mkdir -p logs
orbs-acceptance_1 | + cp fail.out test.out timed.out logs
orbs-acceptance_1 | + exit 1
project_orbs-acceptance_1 exited with code 1
See PR to mitigate the issue with point 3: https://github.com/orbs-network/orbs-network-go/pull/382
The idea is to keep the short timeouts so we fail faster, but find deadlocking goroutines by printing all stack traces in this case
@jlevison @talkol are we good? can we close this issue?
some of the problems we have with acceptance right now are: