Open fabien-nisol opened 8 months ago
Hi,
I would like to bump this ticket because I think that the logic here is flawed:
else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 ) {
throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors")
}
In this scenario, the system might have many running spot instances and it just cannot scale further because there is no quota, for example. I do not think this is a reason to not start a job. The nodes might have enough capacity.
I suggest to change it for:
else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentDedicatedNodes==0 ) {
throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors and no agents are available")
}
Is there a type in this expression pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentDedicatedNodes
?
Sorry, I meant
else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentLowPriorityNodes==0 ) {
throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors and no agents are available")
}
I see. I'd suggest to submit a pull request to address that
New feature
There is already a feature in the azure batch plugin that will allow node pool allocation failures to be retried.
This works if a failure, e.g. a quota being breached, is happening while a pipeline is running and it successfully validated during setup.
The problem is that if the node pool is already failed when the pipeline is starting, the "checkPool" method used during initialization of the batch service will make any starting pipeline fail right away, bypassing the retry policy
Usage scenario
In our case, we want the retryPolicy to prevail in such cases. Our retry policy is set up to allow this error to be retried, because in theory the situation should resolve by itself while the already running pipeline release pressure on the node pool
Suggest implementation
Remove the resize error check from the checkPool method, or make it de-activable by configuration. This should allow the pipeline to go start, at which point we'll catch the resize error and it will be retried.