If build pod is deleted check status instead of assuming Succeeded

jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!

https://binderhub.readthedocs.io

BSD 3-Clause "New" or "Revised" License

2.54k stars 388 forks source link

If build pod is deleted check status instead of assuming Succeeded #1525

Closed manics closed 2 years ago

manics commented 2 years ago

Sometimes a build fails but BinderHub still attempts to launch the image.

I'm not sure if this is a race condition due to a pod exiting before BinderHub receives an error log from repo2docker, but it occurs everytime I try to build https://gist.github.com/manics/3c1a4babe6e9c368ee995c703fc0a2a9 on mybinder.org:

The log shows that postBuild fails with a non-zero error code as expected, but BinderHub continues to attempt to launch it... and obviously fails since the image wasn't built!

With this change the pod status/phase is checked, instead of always assuming it's succeeded.

This ensure BinderHub will not attempt to launch the image, but I don't think this correctly updates the front-end UI to an error status, instead the progress bar is stuck on Building (ignore the rest of the build logs, I'm playing with podman :smile:)

minrk commented 2 years ago

Great job tracking this down! It's been a mystery to me for ages.

I think the failure to update is because of the use of the string 'failure' instead of 'failed' here. Everywhere else, we use 'failed' to indicate that the build has failed.

I'm not sure if there's a deliberate distinction between 'failure' and 'failed'? But only 'failed' is handled in the js.

We might reduce the likelihood of bugs like this if we used the ProgressEvent.BuildStatus enum as a str enum, and sent the string itself. Then it would be easier to keep track of what all valid values are.

minrk commented 2 years ago

Do you think it's worth standardising on "failed" in repo2docker?

Yeah, if these are supposed to mean something to computers and not just humans, they should probably be consistent. Unless there's a useful distinction, my guess is it was an oversight.

manics commented 2 years ago

I'll need to dig into the failing/hanging tests

manics commented 2 years ago

And now action-k3s-helm is failing, https://github.com/k3s-io/k3s/releases/download/latest/sha256sum-amd64.txt isn't working

consideRatio commented 2 years ago

And now action-k3s-helm is failing, https://github.com/k3s-io/k3s/releases/download/latest/sha256sum-amd64.txt isn't working

Oh =/ That will go away soon, they publish the checksums via another workflow triggered by the release.

Nevermind, not sure whats going wrong. They have not cut a release or similar this latest hour or so.

manics commented 2 years ago

Tests are passing after a rebase, though I don't know why there were failing before.

consideRatio commented 2 years ago

I saw a 503 from gitlab, maybe intermittent issues with gitlab?