nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.76k stars 629 forks source link

Let azure batch checks optionaly be ignored #4792

Open fabien-nisol opened 8 months ago

fabien-nisol commented 8 months ago

New feature

There is already a feature in the azure batch plugin that will allow node pool allocation failures to be retried.

This works if a failure, e.g. a quota being breached, is happening while a pipeline is running and it successfully validated during setup.

The problem is that if the node pool is already failed when the pipeline is starting, the "checkPool" method used during initialization of the batch service will make any starting pipeline fail right away, bypassing the retry policy

Usage scenario

In our case, we want the retryPolicy to prevail in such cases. Our retry policy is set up to allow this error to be retried, because in theory the situation should resolve by itself while the already running pipeline release pressure on the node pool

Suggest implementation

Remove the resize error check from the checkPool method, or make it de-activable by configuration. This should allow the pipeline to go start, at which point we'll catch the resize error and it will be retried.

luanjot commented 4 months ago

Hi,

I would like to bump this ticket because I think that the logic here is flawed:

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L578

        else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 ) {
            throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors")
        }

In this scenario, the system might have many running spot instances and it just cannot scale further because there is no quota, for example. I do not think this is a reason to not start a job. The nodes might have enough capacity.

I suggest to change it for:

        else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentDedicatedNodes==0 ) {
            throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors and no agents are available")
        }
pditommaso commented 4 months ago

Is there a type in this expression pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentDedicatedNodes ?

luanjot commented 4 months ago

Sorry, I meant

else if ( pool.resizeErrors && pool.currentDedicatedNodes==0 && pool.currentLowPriorityNodes==0 ) {
            throw new IllegalStateException("Azure Batch pool '${pool.id}' has resize errors and no agents are available")
        }
pditommaso commented 4 months ago

I see. I'd suggest to submit a pull request to address that