orbs-network / orbs-network-go

Orbs node virtual chain core reference implementation in Go
MIT License
48 stars 12 forks source link

Acceptance changes required #352

Closed jlevison closed 5 years ago

jlevison commented 6 years ago

some of the problems we have with acceptance right now are:

  1. it does not output logs, very difficult to debug it
  2. the timeout of 20 minutes is too long in CI, it needs to be much lower (takes around 70secs on my machine with 100 count, but maybe the CI is running something more complex)
  3. the network harness has a waiting mechanism which is flawed conceptually - if something blocks, the test should die on timeout - and then we get all the goroutines of the system in the log and we can debug - there should not be a 'wait-at-most' function there that panics manually, or if it panics manually, it should output proper logs so we understand why the system froze
  4. the waiting 'issue' may not be limited to the network harness but may also be coming from the block tracker
talkol commented 6 years ago

More clarification regarding issue 3:

Currently in the system we have various protections against deadlocks with timeouts.. if an operation takes obscene amount of time - we timeout/panic

This behavior is right for production (e2e) but IMO is wrong for acceptance and component tests

If we reach a deadlock in these tests, I prefer to let the system deadlock and reach the test timeout panic which will print stack traces of all goroutines. These stack traces are the main thing that helps with debugging deadlocks. By panicking manually first we miss out on the chance to easily get these stack traces which are essential to debugging

electricmonk commented 6 years ago

Re: logs - there's a tradeoff, as in most cases you don't want all of the logs. Is it problematic to manually change (and not commit) the logger definition? if so, maybe we can add an external log config file (but I really hate this idea!) or - better - pass logger override directives via environment variables?

Re: timeout - we need to split the acceptance suite; there are the essential tests, that check logic, which need to have a short timeout; then there are stress tests that fuck up gossip so see how the system behaves; these tests run 100 times, so they need a longer timeout; I think that what we need is to run these tests in a separate build phase than normal acceptance. Thoughts?

Re: 3 - please explain what's the current flaw? I'm not sure I follow what exactly you would like to see. I'm not sure we can use the public api's blocking behavior in all use cases - am I wrong?

jlevison commented 6 years ago

re: re: 3, see https://circleci.com/gh/orbs-network/orbs-network-go/2456 log is below but if the test itself timed out, we should have gotten all the goroutines and not just the paniced one, probably the problem with that failure is a system deadlock and we cannot debug it now Moreover, we cannot reproduce it as its a flakiness issue :/

Attaching to project_orbs-acceptance_1
orbs-acceptance_1  | + source ./test.common.sh
orbs-acceptance_1  | + '['  '!='  ]
orbs-acceptance_1  | + NO_LOG_STDOUT=true go test ./test/acceptance -count 100 -timeout 20m -failfast
orbs-acceptance_1  | + check_exit_code_and_report
orbs-acceptance_1  | + export 'EXIT_CODE=1'
orbs-acceptance_1  | + '[' 1 '!=' 0 ]
orbs-acceptance_1  | + grep -B 150 -A 15 -- FAIL: test.out
orbs-acceptance_1  | + cat fail.out
orbs-acceptance_1  | --- FAIL: TestCreateGazillionTransactionsWhileTransportIsDelayingRandomMessages (1.04s)
orbs-acceptance_1  | panic: timed out waiting for transaction with hash ac3178c9826cc697d104c0d5f603c7fab18aecd2b1d435dcffd89530d7ae2b77 [recovered]
orbs-acceptance_1  |    panic: timed out waiting for transaction with hash ac3178c9826cc697d104c0d5f603c7fab18aecd2b1d435dcffd89530d7ae2b77
orbs-acceptance_1  | 
orbs-acceptance_1  | goroutine 12266 [running]:
orbs-acceptance_1  | testing.tRunner.func1(0xc4203c4690)
orbs-acceptance_1  |    /usr/local/go/src/testing/testing.go:742 +0x29d
orbs-acceptance_1  | panic(0x986b40, 0xc420439240)
orbs-acceptance_1  |    /usr/local/go/src/runtime/panic.go:502 +0x229
orbs-acceptance_1  | github.com/orbs-network/orbs-network-go/test/harness/services/blockstorage/adapter.(*inMemoryBlockPersistence).WaitForTransaction(0xc420244740, 0xa87480, 0xc420244640, 0xc420b06400, 0x20, 0x20, 0x3b9aca00, 0x20)
orbs-acceptance_1  |    /go/src/github.com/orbs-network/orbs-network-go/test/harness/services/blockstorage/adapter/memory_persistence.go:57 +0x37f
orbs-acceptance_1  | github.com/orbs-network/orbs-network-go/test/harness.(*inProcessNetwork).WaitForTransactionInStateForAtMost(0xc420355e50, 0xa87480, 0xc420244640, 0x0, 0xc420b06400, 0x20, 0x20, 0x3b9aca00)
orbs-acceptance_1  |    /go/src/github.com/orbs-network/orbs-network-go/test/harness/network.go:85 +0xa5
orbs-acceptance_1  | github.com/orbs-network/orbs-network-go/test/harness.(*inProcessNetwork).WaitForTransactionInState(0xc420355e50, 0xa87480, 0xc420244640, 0x0, 0xc420b06400, 0x20, 0x20)
orbs-acceptance_1  |    /go/src/github.com/orbs-network/orbs-network-go/test/harness/network.go:81 +0x73
orbs-acceptance_1  | github.com/orbs-network/orbs-network-go/test/acceptance.sendTransfersAndAssertTotalBalance(0xa87480, 0xc420244640, 0xa898e0, 0xc420355e50, 0xc4203c4690, 0x64)
orbs-acceptance_1  | + grep -B 150 -A 150 -- 'test timed out' test.out
orbs-acceptance_1  | + cat timed.out
orbs-acceptance_1  | + '[' '!' -s fail.out ]
orbs-acceptance_1  | + mkdir -p logs
orbs-acceptance_1  | + cp fail.out test.out timed.out logs
orbs-acceptance_1  | + exit 1
project_orbs-acceptance_1 exited with code 1
talkol commented 6 years ago

See PR to mitigate the issue with point 3: https://github.com/orbs-network/orbs-network-go/pull/382

The idea is to keep the short timeouts so we fail faster, but find deadlocking goroutines by printing all stack traces in this case

electricmonk commented 6 years ago

@jlevison @talkol are we good? can we close this issue?