medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
469 stars 217 forks source link

Workflows repeatedly fail in e2e tests for muting and unmuting #9391

Open witash opened 2 months ago

witash commented 2 months ago

Describe the issue for an offline user should mute and unmute a clinic while "offline", without processing in between and for an offline user should mute and unmute a person while "offline", with processing in between in /tests/e2e/default/transitions/client-side-muting.wdio-spec.js fail for about 2/3 workflow runs

[chrome 128.0.6613.84 linux #0-2] Muting 
[chrome 128.0.6613.84 linux #0-2]     for an online user
[chrome 128.0.6613.84 linux #0-2]        ✓ should not process client-side when muting as an online user
[chrome 128.0.6613.84 linux #0-2] 
[chrome 128.0.6613.84 linux #0-2]     for an offline user
[chrome 128.0.6613.84 linux #0-2]        ✓ should not process muting client-side if not enabled
[chrome 128.0.6613.84 linux #0-2]        ✖ should mute and unmute a person while "offline", with processing in between
[chrome 128.0.6613.84 linux #0-2]        ✖ "after each" hook for for an offline user
[chrome 128.0.6613.84 linux #0-2]     
[chrome 128.0.6613.84 linux #0-2] 2 passing (1m 50.4s)
[chrome 128.0.6613.84 linux #0-2] 1 failing
[chrome 128.0.6613.84 linux #0-2]
[chrome 128.0.6613.84 linux #0-2] 1) for an offline user should mute and unmute a person while "offline", with processing in between
[chrome 128.0.6613.84 linux #0-2] element ("#header-dropdown li.sync-status .success") still not displayed after 20000ms
[chrome 128.0.6613.84 linux #0-2] Error: element ("#header-dropdown li.sync-status .success") still not displayed after 20000ms
[chrome 128.0.6613.84 linux #0-2]     at async syncAndWaitForSuccess (/home/runner/work/cht-core/cht-core/tests/page-objects/default/common/common.wdio.page.js:329:3)
[chrome 128.0.6613.84 linux #0-2]     at async Object.sync (/home/runner/work/cht-core/cht-core/tests/page-objects/default/common/common.wdio.page.js:350:3)
[chrome 128.0.6613.84 linux #0-2]     at async updateClientSideMutingSettings (/home/runner/work/cht-core/cht-core/tests/e2e/default/transitions/client-side-muting.wdio-spec.js:225:9)
[chrome 128.0.6613.84 linux #0-2]
[chrome 128.0.6613.84 linux #0-2] 2) for an offline user "after each" hook for for an offline user
[chrome 128.0.6613.84 linux #0-2] element ("#header-dropdown li.sync-status .success") still not displayed after 20000ms
[chrome 128.0.6613.84 linux #0-2] Error: element ("#header-dropdown li.sync-status .success") still not displayed after 20000ms
[chrome 128.0.6613.84 linux #0-2]     at async syncAndWaitForSuccess (/home/runner/work/cht-core/cht-core/tests/page-objects/default/common/common.wdio.page.js:329:3)
[chrome 128.0.6613.84 linux #0-2]     at async Object.sync (/home/runner/work/cht-core/cht-core/tests/page-objects/default/common/common.wdio.page.js:350:3)
[chrome 128.0.6613.84 linux #0-2]     at async Context.<anonymous> (/home/runner/work/cht-core/cht-core/tests/e2e/default/transitions/client-side-muting.wdio-spec.js:266:7)

https://github.com/medic/cht-core/actions/runs/10610097929/job/29407417054

Because it does pass 1/3 of the time, the assumption is that the test is broken, rather than the feature, but need to confirm.

Describe the improvement you'd like Fix the test. First thing to try is increasing the timeout, although would be better to fix the underlying issue; 20s should be plenty.

garethbowen commented 2 months ago

@witash I think this is a duplicate as the error is exactly the same as in this issue: https://github.com/medic/cht-core/issues/8936#issuecomment-2283448932 which, if so, is probably caused by #7375

If so, please connect with @dianabarsan as she's started looking in to this and can probably give you some tips on how to go about fixing it.

dianabarsan commented 2 months ago

I've been meaning to change how the service worker is updated for such a long time. I'm down to put some time into it for next week.

dianabarsan commented 2 months ago

@witash is right, I just reran this test 10 times and it kept failing. I ended up disabling it on those branches.

dianabarsan commented 2 months ago

It appears that one of the muting forms fails to submit - the submit report is stuck on loading: image

dianabarsan commented 2 months ago

Ok. I dug further into this:

On CI this fails with: image

Looking at browser logs I see:

[error] Console Api Event: [{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.1","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…at ne.run (https://localhost/polyfills.js:1:6297)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg=="},{"name":"result","type":"object","value":"Object"}]}}]
[error] Console Api Event: [{"type":"string","value":"Error replicating to remote server"},{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.2","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…at ne.run (https://localhost/polyfills.js:1:6297)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg=="},{"name":"result","type":"object","value":"Object"}]}}]
[error] Console Api Event: [{"type":"string","value":"Error replicating to remote server"},{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.3","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…at ne.run (https://localhost/polyfills.js:1:6297)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg=="},{"name":"result","type":"object","value":"Object"}]}}]
[error] Console Api Event: [{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.4","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…H.invoke (https://localhost/polyfills.js:1:10947)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg%3D%3D"},{"name":"result","type":"object","value":"Object"}]}}]
[error] Console Api Event: [{"type":"string","value":"Error replicating to remote server"},{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.5","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…H.invoke (https://localhost/polyfills.js:1:10947)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg%3D%3D"},{"name":"result","type":"object","value":"Object"}]}}]
[error] Console Api Event: [{"type":"string","value":"Error replicating to remote server"},{"type":"object","className":"ye","description":"ye","objectId":"-5119646230267331109.9.6","preview":{"type":"object","description":"ye","overflow":false,"properties":[{"name":"stack","type":"string","value":"TypeError: Failed to fetch\n    at https://localhos…H.invoke (https://localhost/polyfills.js:1:10947)"},{"name":"message","type":"string","value":"Failed to fetch"},{"name":"docId","type":"string","value":"_local/RKqlK_XzqTqVNyk1pzNeHg%3D%3D"},{"name":"result","type":"object","value":"Object"}]}}]

And when I check api logs, do find a request for this _local doc that got interrupted (?):

2024-09-04T13:23:53.251 REQ: c7b252f8-da30-49f7-9acd-d4d13a782f01 172.18.0.1 - PUT /medic/_local/RKqlK_XzqTqVNyk1pzNeHg%3D%3D HTTP/1.0
2024-09-04T13:23:53.271 DEBUG: Checking for a configured outgoing message service 
2024-09-04T13:23:53.312 RES: c7b252f8-da30-49f7-9acd-d4d13a782f01 172.18.0.1 - PUT /medic/_local/RKqlK_XzqTqVNyk1pzNeHg%3D%3D HTTP/1.0 - - - ms

I'm not sure about the reason for this interruption. I'm not seeing this request in Haproxy.

dianabarsan commented 2 months ago

This means that this is unrelated to https://github.com/medic/cht-core/issues/7375

garethbowen commented 2 months ago

[error] Console Api Event: [{"type":"string","value":"Error replicating to remote server"}

I'm wondering if it's related to throttling the browser to offline then online again. Maybe going online is failing silently, or the await is resolving too quickly? If we're confident it's a testing only issue, then we could experiment with adding a pause after going back online, or even better waiting for a remote request to succeed before being confident that we're online again.

dianabarsan commented 2 months ago

I've moved this to 4.12.

related to throttling the browser to offline then online again

This could be. I've just implemented a retry and the error proved non persistent.

Locally. I had geolocation timing out, which caused the endless loading I detailed above. This seems to not affect the test when run in CI.

lorerod commented 3 weeks ago

Moved to 4.15.0 so as not to hold up the release.

tatilepizs commented 1 week ago

@dianabarsan are you working on fixing this test? I was planning on investigating to see if I could fix it, but first, I wanted to ensure I would not interfere.

dianabarsan commented 1 week ago

I tried ... got frustrated and set off to greener pastures. Have a knack at it!

latin-panda commented 5 days ago

Moving it 4.16