nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 632 forks source link

Azure Batch is currently out of capacity causes Nextflow to fail. #3257

Open adamrtalbot opened 2 years ago

adamrtalbot commented 2 years ago

Bug report

If running in an Azure region where the virtual machine size is out of capacity, Nextflow immediately dies.

Expected behavior and actual behavior

It should submit the autoscale formula, then allow Azure Batch to handle the rescaling.

Steps to reproduce the problem

Unfortunately, Azure are a bit opaque with how many machines are available in a region. I have reached out to them to see if they can help.

Program output

Caused by:
  Azure Batch pool 'nf_worker_pool' has resize errors

java.lang.IllegalStateException: Azure Batch pool 'nf_worker_pool' has resize errors
    at nextflow.cloud.azure.batch.AzBatchService.checkPool(AzBatchService.groovy:484)
    at nextflow.cloud.azure.batch.AzBatchService.getOrCreatePool(AzBatchService.groovy:536)
    at nextflow.cloud.azure.batch.AzBatchService.submitTask(AzBatchService.groovy:271)
    at nextflow.cloud.azure.batch.AzBatchTaskHandler.submit(AzBatchTaskHandler.groovy:93)
    at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
    at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:562)
    at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:387)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1268)
    at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1048)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1142)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at groovy.lang.Closure.call(Closure.java:412)
    at groovy.lang.Closure.call(Closure.java:406)
    at groovy.lang.Closure.run(Closure.java:493)
    at java.base/java.lang.Thread.run(Thread.java:833)

Environment

Additional context

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso commented 1 year ago

@adamrtalbot is this still an issue?

adamrtalbot commented 1 year ago

Yep, as far as I'm aware. It's a really hard thing to test because you need to exhaust all the machines in the region. FPGAs perhaps?

mouzkolit commented 10 months ago

Is there any updates on this particular issue? we also ran multiple times into this error. I thought that it might be possible to use error strategies to resolve this and retry again but I also can't find a specific error code.

adamrtalbot commented 10 months ago

Not really - we haven't got a reliable we of recreating it. Once we have that it shouldn't be too hard to fix. If you've ran into it multiple times could you provide a step-by-step for recreating the issue here?