Closed JeffreyDevloo closed 7 years ago
I wonder why that list is empty in the first place, we need to make sure that we understand the usecase before we can handle this situation.
Issue is NOT FIXED, by above commit I could not reproduce the issue, nor could i theoretically find a way to get in this situation where the index error is raised I've added additional logging to be able to find out where the problem is situated if ever manage to reproduce this
@kvanhijf if we can't reproduce the issue, what should be checked by QA to verify your code changes.
@wimpers : good question :) Nothing i guess, cause its quite impossible to reproduce, if possible at all
But code changes were done, what for? Should these be reverted ?
Code changes have been done to prevent the index out of range, instead a TimeoutError will be thrown IF we ever manage to get into the path again. This means the initial job will never get launched and should be retried by customer. In order to reproduce, the assumption is that it has something to do with starting a configure_disk job, restarting the workers on some node and trigger another configure_disk job. But i didn't manage to trigger the path.
@wimpers, @saelbrec, It's a racecondition with tasks on celery, and indeed almost impossible to reproduce. But the code showed a path that could cause the racecondition, which is now handled more correctly.
We no longer encountered the error anymore during two and half month of installations and assigning of disk roles on our nightly builds and on our manual setups Therefore I am inclined to close this due to the nature of not being able to reproduce.
Latest packages of the time of writing this:
We saw this on a hyperconverged node during role configuration: