Closed novemberkilo closed 1 year ago
Thanks for this PR. My main comment is about extracting the retry loop in a recursive function, so that we can retry 3 or 4 times instead of once.
@novemberkilo
Looking at existing failures in docs CI, I also see (for example in https://docs.ci.ocaml.org/job/2023-04-10/161235-voodoo-prep-11de8a) Disconnected: Switch turned off
.
I don't know where it's coming from but won't be handled by the current scheme as it doesn't appear in the logs.
One thing I suspect is the let** _ = Current_ocluster.Connection.run_job ... in
body won't execute if run_job
returns an error. In that case the error should also be considered as temporary and go in the retry loop.
@TheLortex I have updated the PR along the lines of your feedback. I am now trying to figure out how best to test this and I'm wondering how to mock the various things that go into Prep.build
... or whether refactoring the function so that I can avoid (most of) the mocking is the way to go. I'd appreciate any thoughts you may have on the matter. Thank you :)
Following a suggestion from @TheLortex this PR introduces a naive way of retrying on temporary failures. The point is to introduce a small change and see if the retry mechanism addresses #103 - if it does, we can proceed towards a solution that is more robust.