iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4.04k stars 339 forks source link

Investigate buffering issues on GitHub runners #1301

Open 0x2b3bfa0 opened 1 year ago

0x2b3bfa0 commented 1 year ago

1299 is a duct tape workaround to what seems to be a buffering issue on the GitHub self-hosted runner output, which prevents our logic from detecting when the runner transitions to a busy state in a timely manner; see https://github.com/iterative/cml/issues/1255#issuecomment-1367099955 for context.

dacbd commented 1 year ago

hmm I suspect this to be the source of my failed termination issues, on cml-playground, I say that I have observed the opposite though... the runner detects the first job and enters its busy state but fails to detect the job finishing and never enters an idle state.

tasdomas commented 1 year ago

Is there no better way of detecting the runner state, other than parsing logs?

On Fri, Dec 30, 2022, 18:40 Daniel Barnes @.***> wrote:

hmm I suspect this to be the source of my failed termination issues, on cml-playground, I say that I have observed the opposite though... the runner detects the first job and enters its busy state but fails to detect the job finishing and never enters an idle state.

— Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/1301#issuecomment-1368004613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5DNAE4YCE6KXLEVS752TWP4GAPANCNFSM6AAAAAATM776OA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dacbd commented 1 year ago

We also call the GitHub API, but that also has its own issues, and it's only used as final validation.

0x2b3bfa0 commented 1 year ago

the runner detects the first job and enters its busy state but fails to detect the job finishing and never enters an idle state

Sounds, indeed, like a buffering issue. 🤔 Node.js treatment of file descriptors is, indeed, cough very particular, and anything is possible.

0x2b3bfa0 commented 1 year ago

Is there no better way of detecting the runner state, other than parsing logs?

At least, none known to me. 😰