oda-hub / oda-bot

0 stars 0 forks source link

Add a way to trigger redeployment of a specific project #35

Open dsavchenko opened 10 months ago

dsavchenko commented 10 months ago

If the project failed to deploy, this is written in status and no redeployment attempts are happening until the new commit. Currently, I sometimes trigger rebuild with an empty commit, but it would be good to have a way to do it without touching the repo. Any ideas @volodymyrss ?

volodymyrss commented 10 months ago

Why do you need to trigger, is there some change which is causing the fix?

Deployment version can include both repository and bot version, so if bot is improved you can redeploy.

If the issue was temporary bot should retry periodically, with a changing period.

It's declarative and bot reconciles like flux controllers.

dsavchenko commented 10 months ago

Deployment version can include both repository and bot version, so if bot is improved you can redeploy.

That's a good idea

If the issue was temporary bot should retry periodically, with a changing period.

It's OK, but I dadn't want the bot to spam the user with "failed" emails. Probably, if the bot sends email only on first failure and then only in case of success, could be the way

volodymyrss commented 10 months ago

If the issue was temporary bot should retry periodically, with a changing period. It's OK, but I dadn't want the bot to spam the user with "failed" emails. Probably, if the bot sends email only on first failure and then only in case of success, could be the way

Ideally only system issues would be retried. I.e. issues which could not be interpreted as user-caused container build issues. For these issues user email might not be needed, users can do little with these messages.

But if it's hard to distinguish, and anyway, user emails should be throttled as you say, one per failed deployment.

dsavchenko commented 10 months ago

It's probably possible to distinguish. Kaniko propagates the exit code of the build, and it can be obtained from the failed pod status. When pod is killed due to some platform problem, it most probably will have exit code 137 (but to be explored, ref). If possible exit codes produced by build errors do not fall in >128 range, we can use them to distinguish the cause of the failure. Also, if we know these error codes well enough, we can make another improvement by using pod failure policy in job definition.

(This is more of a note for myself. All these changes go to nb2workflow)