Flaky e2e test search matches telemetry

dianabarsan commented 2 weeks ago

Describe the issue New telemetry tests appear to be flaking: Example run: https://github.com/medic/cht-core/actions/runs/11702688507/job/32592094267

dianabarsan commented 2 weeks ago

I had to re-run this job 6 times today: https://github.com/medic/cht-core/actions/runs/11717271778

dianabarsan commented 2 weeks ago

cc @m5r

dianabarsan commented 2 weeks ago

More sample runs (from other branches) where this test fails: https://github.com/medic/cht-core/actions/runs/11720994250/job/32650105429?pr=9611 https://github.com/medic/cht-core/actions/runs/11720993712/job/32650884227?pr=9611

It looks like this is blocking PRs from getting merged. I'm suggesting we disable this test and prioritize stabilizing it before re-enabling.

dianabarsan commented 2 weeks ago

I've reran this test at least 10 times and it's always failed. I will disable it pending a fix.

m5r commented 1 week ago

I can't get to reproduce the issue locally for now. I tried throttling chrome with wdio's browser.throttleCPU(8), I tried actually throttling my CPU by making couchdb re-index views and have it use >90% of my CPU.

dianabarsan commented 1 week ago

Are you running the whole e2e suite when trying to reproduce?

m5r commented 1 week ago

No I tried running:

only the describe('search matches telemetry', ...)
only the telemetry.wdio-spec.js file
the telemetry.wdio-spec.js file + one or two additional test files

I'll run the whole suite and see if it triggers the bug more consistently than I have seen so far

m5r commented 1 week ago

That doesn't seem to influence the flakiness of the telemetry tests. I've had these 3 target accuracy tests failing repeatedly across most runs locally but they seem to run fine in the CI

[chrome 130.0.6723.116 linux #0-75] » /tests/e2e/default/targets/target-accuracy.wdio-spec.js
[chrome 130.0.6723.116 linux #0-75] Target accuracy
[chrome 130.0.6723.116 linux #0-75]    ✓ should save target document on first calculation
[chrome 130.0.6723.116 linux #0-75]    ✓ should save target document when targets change
[chrome 130.0.6723.116 linux #0-75]    ✓ should not save target document when editing counted contact
[chrome 130.0.6723.116 linux #0-75]    ✓ should not save target document when adding report for counted contact
[chrome 130.0.6723.116 linux #0-75]    ✖ should save target document when deleting counted contact (5 retries)
[chrome 130.0.6723.116 linux #0-75]    ✖ should save target doc once when getting many changes through replication (5 retries)
[chrome 130.0.6723.116 linux #0-75]    ✓ should only create one target doc
[chrome 130.0.6723.116 linux #0-75]    ✖ should handle old format of the rules-state-store (5 retries)

These two failed once

[chrome 130.0.6723.116 linux #0-28] » /tests/e2e/default/enketo/pregnancy-complete-a-delivery.wdio-spec.js
[chrome 130.0.6723.116 linux #0-28] Contact Delivery Form
[chrome 130.0.6723.116 linux #0-28]    ✓ Complete a delivery: Process a delivery with a live child and facility birth, verify that the past pregnancy card is present and the report was created,verify that the chil
d registered during birth is created and displayed the proper information,verify that the targets page is updated
[chrome 130.0.6723.116 linux #0-28]    ✓ open, submit and edit (no changes) default delivery form
[chrome 130.0.6723.116 linux #0-28]    ✖ open, submit and edit default delivery form (5 retries)

[chrome 130.0.6723.116 linux #0-36] » /tests/e2e/default/enketo/submit-photo-upload-form.wdio-spec.js
[chrome 130.0.6723.116 linux #0-36] Submit Photo Upload form
[chrome 130.0.6723.116 linux #0-36]    ✖ "before all" hook for Submit Photo Upload form
[chrome 130.0.6723.116 linux #0-36]
[chrome 130.0.6723.116 linux #0-36] 1 failing (10m 0.1s)
[chrome 130.0.6723.116 linux #0-36]
[chrome 130.0.6723.116 linux #0-36] 1) Submit Photo Upload form "before all" hook for Submit Photo Upload form
[chrome 130.0.6723.116 linux #0-36] Timeout
[chrome 130.0.6723.116 linux #0-36] Error: Timeout
[chrome 130.0.6723.116 linux #0-36]     at listOnTimeout (node:internal/timers:581:17)
[chrome 130.0.6723.116 linux #0-36]     at processTimers (node:internal/timers:519:7)

But other than that, I couldn't reproduce the telemetry bug. I'll see if I can trigger it in the CI using this branch

dianabarsan commented 1 week ago

Could it have been something on that date?

m5r commented 1 week ago

Now that's interesting, I managed to reproduce it by mocking the Date object and going back to the same point in time as the CI failures. Thanks for the idea! I'm sorting out a fix

m5r commented 1 week ago

I got around to it and it was a disappointingly dumb bug in how the telemetry docs are fetched in the test. Telemetry databases have the date in their name and the telemetry service doesn't pad the date's digits with leading zeros, meaning for November 6th 2024 it will format the date as 2024-11-6 while the test code formatted the date as 2024-11-06 🤦‍♂ I don't know if not padding the digits is ISO 8601 compliant but that's another issue. CI is running with the fixed test code

medic / cht-core

Flaky e2e test search matches telemetry #9622