uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
7.97k stars 773 forks source link

Suppress test failures due to logs after tests complete #6067

Closed Groxx closed 1 month ago

Groxx commented 1 month ago

"Fixes" test failures like this, by sweeping them under the rug:

2024/05/28 21:39:29 ----- Done -----
2024/05/28 21:39:29 Schema setup complete
PASS
coverage: 27.3% of statements in github.com/uber/cadence/client/..., github.com/uber/cadence/common/..., github.com/uber/cadence/host/..., github.com/uber/cadence/service/..., github.com/uber/cadence/tools/...

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T21:39:39.501Z  DEBUG   Selected default store shard for tasklist   {"store-shard": "NonShardedStore", "wf-task-list-name": "9985f719-4b6a-4f0a-97c7-41a9e00d2414", "logging-call-at": "sharding_policy.go:100"}

goroutine 72245 [running]:
testing.(*common).logDepth(0xc0016ee680, {0xc00695ae00, 0xd5}, 0x3)
    /usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
    /usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0016ee680, {0x4a884b4, 0x2}, {0xc0021324b0, 0x1, 0x1})
    /usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x662be10?, 0xc0016ee680?}, 0xa0?}, {0xc003ecf400, 0xd6, 0x400})
    /go/pkg/mod/go.uber.org/zap@v1.13.0/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0030ad920, {0xff, {0xc18db1a6dddeef12, 0xa3b5c069bd, 0x8268f40}, {0x0, 0x0}, {0x4af6670, 0x29}, {0x0, ...}, ...}, ...)
...

When the test completes, it will simply log to stderr rather than the test logger:

❯ go test -count 1 -v ./...
=== RUN   TestLoggerShouldNotFailIfLoggedLate
--- PASS: TestLoggerShouldNotFailIfLoggedLate (0.00s)
PASS
2024-05-29T16:20:50.742-0500    INFO    COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, orig{"logging-call-at": "testlogger_test.go:41", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
2024-05-29T16:20:50.742-0500    INFO    COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, with{"actor-id": "testing", "logging-call-at": "testlogger_test.go:42", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
ok      github.com/uber/cadence/common/log/testlogger   0.586s

Ignoring the correctness part of the problem, this gives us the best of both worlds:

I am not overly fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything. Seems like a reasonable tradeoff, and we can move to an opt-in model eventually instead of applying it everywhere like this PR does.

Groxx commented 1 month ago

unfortunately not sure which test is producing these, AFAICT they're not testlogger-using, so I've just replaced everything I can find:

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T22:34:12.141Z  ERROR   Persistent store operation failure  {"component": "matching-engine", "wf-domain-name": "integration-test-domain-a76efb8a-48f9-47fb-a74a-9e91f7713bd7", "wf-task-list-name": "integration-continue-as-new-workflow-timeout-test-tasklist", "wf-task-list-type": 0, "store-operation": "update-task-list", "error": "UpdateTaskList operation failed. Error: session has been closed", "logging-call-at": "task_reader.go:224"}
goroutine 74122 [running]:
testing.(*common).logDepth(0xc0014a2820, {0xc001b0c1c0, 0x1bb}, 0x3)
    /usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
    /usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0014a2820, {0x4a89a74, 0x2}, {0xc00732a290, 0x1, 0x1})
    /usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x7f1dc0d489c0?, 0xc0014a2820?}, 0x80?}, {0xc006340800, 0x1bc, 0x400})
    /go/pkg/mod/go.uber.org/zap@v1.13.0/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc008adac00, {0x2, {0xc18db4d9086c398b, 0x9eec277b0a, 0x826bf60}, {0x0, 0x0}, {0x4adb3e2, 0x22}, {0x0, ...}, ...}, ...)
    /go/pkg/mod/go.uber.org/zap@v1.13.0/zapcore/core.go:90 +0x193
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc005404210, {0xc008284400, 0x3, 0x4})
    /go/pkg/mod/go.uber.org/zap@v1.13.0/zapcore/entry.go:216 +0x210
go.uber.org/zap.(*Logger).Error(0xc008581080, {0x4adb3e2, 0x22}, {0xc008284400, 0x3, 0x4})
    /go/pkg/mod/go.uber.org/zap@v1.13.0/logger.go:203 +0x68
github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error(0xc0073c5620, {0x4adb3e2, 0x22}, {0xc0071f9480, 0x2, 0x2})
    /cadence/common/log/loggerimpl/logger.go:150 +0x19f
github.com/uber/cadence/service/matching/tasklist.(*taskReader).getTasksPump(0xc003ada000)
    /cadence/service/matching/tasklist/task_reader.go:224 +0xe17
github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start.func2()
    /cadence/service/matching/tasklist/task_reader.go:141 +0x98
created by github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start in goroutine 74130
    /cadence/service/matching/tasklist/task_reader.go:139 +0x38e
FAIL    github.com/uber/cadence/host    682.736s
FAIL

If this commit doesn't work, I'll have to see if I screwed up the implementation.

gonna rerun a few times and check output, I believe I should see "bad" logs or it's probably just lacking evidence that it fixed anything.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.16%. Comparing base (5f45da5) to head (01b0a67). Report is 2 commits behind head on master.

Additional details and impacted files [see 13 files with indirect coverage changes](https://app.codecov.io/gh/uber/cadence/pull/6067/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/uber/cadence/pull/6067?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/uber/cadence/pull/6067?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Last update [5f45da5...01b0a67](https://app.codecov.io/gh/uber/cadence/pull/6067?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber).
Groxx commented 1 month ago

.... I think I'm going to sit on this for a while. I tried to construct a test that would show if I implemented this correctly, and it pretty randomly fails or succeeds in super duper confusing ways. Even several seconds later, when waiting in a different test entirely, it still doesn't always mark t.done = true in the test for some reason, and doesn't always notice the log. Which seems insane from the implementation.

Not sure if my Go is screwing up or blending caches or if this log-detecting is just fundamentally unreliable for some reason.


edit: well. it's fundamentally unreliable. https://github.com/golang/go/issues/67701 I'm not sure if this is something that'll change or not. it may fail too many tests...