opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Overnight generative tests are locking up and then timing out #1796

Open evansd opened 9 months ago

evansd commented 9 months ago

The overnight tests appear to start running OK, complete a small number of batches (5 in one case, 8 in another) and then lock up and sit there for several hours until Github times them out e.g.

image

(Note the difference between the last two timestamps.)

This has happened twice in a row as of the time of writing. Current list of runs is at: https://github.com/opensafely-core/ehrql/actions/workflows/generative-tests.yml

This comes just after merging this changes:

Which is a bit suspicious, although I can't think what in those changes would cause this kind of behaviour.

evansd commented 9 months ago

Since occurring twice in a row this behaviour hasn't reappeared in the subsequent four runs so it may just have been glitchy Github rather than a consequence of any changes we made. I'll leave this ticket open a bit longer and then close if we don't see this happening again.

evansd commented 8 months ago

This is still happening quite regularly. It doesn't render the tests useless as they do still run correctly more often than not. But it does mean we're doing less testing than we might otherwise be and also that I (as the last person to touch the scheduled action definition) get spurious email notifications when this happens. (We don't get Slack notifications because there's no compute time left in which to send them.)

It's easy to identify this behaviour in the logs: it's every failed action whose runtime is just a few seconds over exactly 6 hours. https://github.com/opensafely-core/ehrql/actions/workflows/generative-tests.yml