nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 630 forks source link

[Nextflow + Azure Batch] Unable to find size for VM name 'Standard_E4ads_v5' and location 'germanywestcentral' #5076

Open landroutsosAIE opened 5 months ago

landroutsosAIE commented 5 months ago

Bug report

I am trying to use Azure batch with Nextflow and Seqera, but I cant initiate any job because of wrong VM name, location name or not matching any VM of this name at the specific region.

Expected behavior and actual behavior

I am running this command as a test to check my Azure batch config at the Nextflow level:

nextflow run nf-core/rnaseq -profile test,docker -c .nextflow/azure_batch_19_06.config --outdir "az://firstcontainer/testrun_19_06/" -w "az://firstcontainer/work_19_06" -with-tower

My config file:

// Scale formula to use low-priority nodes only.
lowPriorityScaleFormula = '''
    lifespan = time() - time("{{poolCreationTime}}");
    interval = TimeInterval_Minute * {{scaleInterval}};
    $samples = $PendingTasks.GetSamplePercent(interval);
    $tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max($PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
    $targetVMs = $tasks > 0 ? $tasks : max(0, $TargetLowPriorityNodes/2);
    targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
    $TargetLowPriorityNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
    $TargetDedicatedNodes = 0;
    $NodeDeallocationOption = taskcompletion;
'''

process {
    executor = 'azurebatch'
    container = 'nfcore/rnaseq:latest'
    queue = 'Standard_E4_2ads_v5'
    withLabel:process_low {queue = 'Standard_E4_2ads_v5'}
    withLabel:process_medium {queue = 'Standard_E8_4ads_v5'}
    withLabel:process_high {queue = 'Standard_E16_8ads_v5'}
    withLabel:process_high_memory {queue = 'Standard_E32_16ads_v5'}
}
azure {
        storage {
                accountName = "<myaccountname>"
                accountKey = "<myaccountkey>"
        }
        batch {
                location = "germanywestcentral"
                accountName = "<mybatchname>"
                accountKey = "<myaccountkey>"

        autoPoolMode = false
        allowPoolCreation = true
        deletePoolsOnCompletion = true

        pools {
            Standard_E4_2ads_v5 {
                autoScale = true
                vmType = 'Standard_E4-2ads_v5'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E8_4ads_v5 {
                autoScale = true
                vmType = 'Standard_E8-4ads_v5'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E16_8ads_v5 {
                autoScale = true
                vmType = 'Standard_E16-8ads_v5'
                vmCount = 2
                maxVmCount = 20
                scaleFormula = lowPriorityScaleFormula
            }
            Standard_E32_16ads_v5 {
                autoScale = true
                vmType = 'Standard_E32-16ads_v5'
                vmCount = 2
                maxVmCount = 10
                scaleFormula = lowPriorityScaleFormula
            }
        }
    }
}

The expected behavior was to run the rnaseq test correctly at Seqera, using Azure Batch for job scheduling and computational resources management, but it can't access the VMs I am specifying.

My Azure Batch quota is the following: 256 EADSv5 Vm Series,

Program output

The error is this:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:UNTAR_SALMON_INDEX (salmon.tar.gz)'

Caused by:
  Unable to find size for VM name 'Standard_E4ads_v5' and location 'germanywestcentral'

The error at .nextflow.log is this:

Jun-19 10:01:08.286 [FileTransfer-9] DEBUG nextflow.file.FilePorter - Copying foreign file https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz to work dir: az://firstcontainer/work_19_06/stage-9cbe4492-a38b-4ffc-963e-534fe37e66e5/21/aa88e373263b112da5b5b5205d6d4a/SRR6357074_1.fastq.gz
Jun-19 10:01:08.304 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:UNTAR_SALMON_INDEX (salmon.tar.gz); work-dir=az://firstcontainer/work_19_06/7c/3aa463c2576a48a891e0ee4c1e5e1c
  error [java.lang.IllegalArgumentException]: Unable to find size for VM name 'Standard_E4ads_v5' and location 'germanywestcentral'
Jun-19 10:01:08.313 [FileTransfer-8] DEBUG nextflow.file.FilePorter - Copying foreign file https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz to work dir: az://firstcontainer/work_19_06/stage-9cbe4492-a38b-4ffc-963e-534fe37e66e5/44/698042ca0fd803daa2d7363806d8b9/SRR6357076_2.fastq.gz
Jun-19 10:01:08.313 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:UNTAR_SALMON_INDEX (salmon.tar.gz)'

Caused by:
  Unable to find size for VM name 'Standard_E4ads_v5' and location 'germanywestcentral'

java.lang.IllegalArgumentException: Unable to find size for VM name 'Standard_E4ads_v5' and location 'germanywestcentral'
        at nextflow.cloud.azure.batch.AzBatchService.memoizedMethodPriv$getVmTypeStringString(AzBatchService.groovy:237)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
        at nextflow.cloud.azure.batch.AzBatchService$_closure5.doCall(AzBatchService.groovy)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
        at groovy.lang.Closure.call(Closure.java:427)
        at org.codehaus.groovy.runtime.memoize.Memoize$MemoizeFunction.lambda$call$0(Memoize.java:137)
        at org.codehaus.groovy.runtime.memoize.ConcurrentCommonCache.getAndPut(ConcurrentCommonCache.java:137)
        at org.codehaus.groovy.runtime.memoize.ConcurrentCommonCache.getAndPut(ConcurrentCommonCache.java:113)
        at org.codehaus.groovy.runtime.memoize.Memoize$MemoizeFunction.call(Memoize.java:136)
        at nextflow.cloud.azure.batch.AzBatchService.getVmType(AzBatchService.groovy)
        at nextflow.cloud.azure.batch.AzBatchService.specFromPoolConfig(AzBatchService.groovy:542)
        at nextflow.cloud.azure.batch.AzBatchService.specForTask(AzBatchService.groovy:608)
        at nextflow.cloud.azure.batch.AzBatchService.getOrCreatePool(AzBatchService.groovy:615)
        at nextflow.cloud.azure.batch.AzBatchService.submitTask(AzBatchService.groovy:320)
        at nextflow.cloud.azure.batch.AzBatchTaskHandler.submit(AzBatchTaskHandler.groovy:91)
        at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
        at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:565)
        at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:390)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
        at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1042)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1128)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
        at groovy.lang.Closure.call(Closure.java:427)
        at groovy.lang.Closure.call(Closure.java:406)
        at groovy.lang.Closure.run(Closure.java:498)
        at java.base/java.lang.Thread.run(Thread.java:829)

Environment

What could be the issue here? Thanks in advance!

adamrtalbot commented 4 months ago

Hi @landroutsosAIE, it doesn't seem to appear in this list of VMs by region.. I've updated the list here:

https://github.com/nextflow-io/nextflow/pull/5100

related to https://github.com/nextflow-io/nextflow/issues/2994

landroutsosAIE commented 4 months ago

Hello @adamrtalbot. Thank you for help. I will wait for the pull request to be accepted. I have an other problem with the same pipeline. I changed the Azure Batch config and now it works until the Salmon quant step. It stops with exit status 1 and the real error shows up at command.log file:
Unable to download path: https://<ourblobstorage>/test_run_07_01/7a/4ac8cca20ea799d9c65917be044a73/salmon

So it can't download the salmon folder from the previous step.

We are running this pipeline in Seqera too and from the four same tasks, one succeeded, two didnt with exit 1 and one didnt with exit 137 (which i suppose is a RAM problem). We are using max memory 256gb. image

adamrtalbot commented 4 months ago

That's unusual, does the blob directory include the expected file? Does resume work? I presume the task that exited with error code 137 was running on a machine with 256gb of storage?

When using Seqera Platform, you shouldn't need to specify any of this configuration. I would try and remove anything around configuring storage and batch accounts.

landroutsosAIE commented 4 months ago

Yes, the folder exists in the blob directory. I think the problem is with my Azure batch config for nextflow. it didnt used the high memory process VM series that I was assigning. I am now running the pipeline with only the high memory process VM (with 256 gb ram) and I will get back at you. I am using Seqera (-with-tower parameter) only for better monitoring of my pipeline.

pditommaso commented 4 months ago

Can this be considered solved by https://github.com/nextflow-io/nextflow/pull/5100?

adamrtalbot commented 4 months ago

Currently getting error 😱 :

ERROR ~ Error executing process > 'sayHello (3)'

Caused by:
  Cannot find a VM for task 'sayHello (3)' matching these requirements: type=Standard_E4-2ads_v5, cpus=1, mem=-, location=useast
adamrtalbot commented 4 months ago

Adding some logging it's failing to find the Azure VMs in a region:

Jul-03 11:17:09.642 [Task submitter] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Submitting task sayHello (4) - work-dir=az://scidev-useast/aa/a13d6b287c3b1e178396256bce01be
Jul-03 11:17:10.120 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] guessing best VM given location=useast; cpus=1; mem=null; family=Standard_E4-2ads_v5
Jul-03 11:17:10.120 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Finding best VM given location=useast; cpus=1; mem=null; family=Standard_E4-2ads_v5
Jul-03 11:17:10.121 [Task submitter] WARN  n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Unable to find Azure VM names for location: useast
Jul-03 11:17:10.121 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Found 0 VM types in location useast
Jul-03 11:17:10.121 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Listing VM families
Jul-03 11:17:10.121 [Task submitter] DEBUG n.cloud.azure.batch.AzBatchService - [AZURE BATCH] Found 0 VM types matching the criteria
adamrtalbot commented 4 months ago

Idiot

useast vs eastus. Going to add another check for that 🤦

Done: https://github.com/nextflow-io/nextflow/pull/5108