nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.76k stars 629 forks source link

Resume not working on Azure Batch #3605

Open alliemclean opened 1 year ago

alliemclean commented 1 year ago

Bug report

resume option works locally but not with azure batch. I've gone through all the troubleshooting steps and can't find the reason the cache is invalidated

Expected behavior and actual behavior

resume should use the cache and not re-process cached jobs

Steps to reproduce the problem

link to input

nextflow run pipeline_test.nf \
 --name "resume_test_azure" \
 --in az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz \
 --lines 10000 \
 --outdir az://pipeline/resume_results/ \
 -w az://pipeline/resume_working/ \
 -c azure.config \
 -dump-hashes \
 -resume

Program output

Uploaded logs for local and batch runs, run command lines, and config files: test_resume.tar.gz

Environment

Additional context

resume works with local execution and not with azure. Cache hashes are all the same as far as I can tell.

resume is also working on batch (same account, same settings) with a different pipeline.

abhi18av commented 1 year ago

Thanks for letting us know about this issue @alliemclean , allow me some time to circle back once I've investigated this further.

abhi18av commented 1 year ago

Hey @alliemclean , circling back on this one after a lot of attempts unfortunately I'm not seeing this occur in my env.

image

That being said, I did tweak the configuraition file to remove the fileshare - which shouldn't affect anything in the pipeline since that's not being referenced in the analysis.

alliemclean commented 1 year ago

Hi @abhi18av I tried removing the fileshare and it still didn't resume. I'm having the same problem with 2 out of 3 pipelines I'm running. The other one resumes one of the jobs in the pipeline but not the rest. The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple. Could this be the issue? Is there a way to check whether the cache has a problem with the process's output?

Could there be anything within our azure environment that is causing the difference in this behavior?

abhi18av commented 1 year ago

Mmm, that might be the case - but it could also be depending on the design of those two pipelines.

The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple.

Hmm, I don't particularly see this should break the caching behavior 🤔

Is it a public NF pipeline which you are running? Perhaps an nf-core one which I could use to test things as well?

CC @vsmalladi for any inputs/prior-experience here?

alliemclean commented 1 year ago

It's not an nf-core one, but I suppose I could see if I can reproduce it with one, although the example I've given is complete and sufficient to cause the issue I'm having on my end. I can try to reproduce with this module, snpsift_split when I have time.

vsmalladi commented 1 year ago

@alliemclean can you quote the az parts

nextflow run pipeline_test.nf \ --name "resume_test_azure" \ --in "az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz" \ --lines 10000 \ --outdir "az://pipeline/resume_results/" \ -w "az://pipeline/resume_working/" \ -c azure.config \ -dump-hashes \ -resume

mawasak commented 1 year ago

@alliemclean Are you running different pipelines (with and without "-resume") from the same folder? Nextflow get some values from .nextflow/history file. So maybe try run Azure from different folder if so.

Try to run pipeline with parameter "-trace nextflow" e.g. nextflow -trace nextflow run hello.nf -resume .... Next in .nextflow.log check line "Cacheable folder=" (is it path or null)?

alliemclean commented 1 year ago

This is the second run and it didn't use the cache. I tried in it's own folder. Here some parts of the log I thought might be relevant:

image

image

mawasak commented 1 year ago

I think that the problem may be connected with exitCode for task . Only "0" is success code. The exit code is one of the components that allows the task to be treated as cacheable.

The log on a first screenshot is ok.

alliemclean commented 1 year ago

It was successful, but I ran it many times so this one could have been after an unsuccessful run. Here's another try after a successful run, but Cacheable folder=null. Is there a reason it may not pick it up? actual run folder doesn't have the az:// prefix:

Apr-21 17:33:45.559 [Actor Thread 6] TRACE nextflow.processor.TaskProcessor - Poison pill arrived; port: 0 Apr-21 17:33:45.562 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - Process state changed to: StateObj[submitted: 1; completed: 0; poisoned: true ] -- finished: false Apr-21 17:33:45.736 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Store dir not set -- return false Apr-21 17:33:47.253 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Cacheable folder=null -- exists=false; try=1; shouldTryCache=false; entry=null Apr-21 17:33:47.725 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] actual run folder: /pipeline/pipeline_name/6f/d30552c750ac633db6590e57493f5f

FabianDK commented 1 year ago

Hi everyone,

I have the same issue. Running a very simple script with azure batch that is not being cached.

In my case, I believe that my nextflow run identifies the previously successfully run workflow, but for some reason does not find the output file (.txt) although it is in the workdir (on the azure storage).

Jun-28 17:51:41.619 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - [MAKE_TEST_FILE_HOST] Missed cache > Missing output file(s) `*.txt` expected by process `MAKE_TEST_FILE_HOST` -- folder: /nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7
Jun-28 17:51:41.620 [Actor Thread 5] TRACE nextflow.file.FileSystemPathFactory - File system path factories: [nextflow.cloud.azure.file.AzPathFactory@436bd4df]

I assume an explanation could be that nextflow is trying to look at a local workDir folder instead of the one in the cloud (az://nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7).

adamrtalbot commented 1 year ago

I may have found the problem here, could the people seeing this confirm which type of storage account they are using? Are they using Data Lake gen2 storage?

If so, could people have a look inside the cache dir on Azure storage and tell me if there are any contents?

FabianDK commented 1 year ago

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

The workDir and its contents are present on the Azure storage (config setting: workDir = "az://blob_container/folder/nf_work/") and so are the publishDir/copy output files (config setting: params.outdir = "az://blob_container/folder/data_tmp/nf_output"). We are not using the fileShares option.

After running a more complex RNAseq workflow, it is interesting to note that only the first 2 processes are cached. Unclear why exactly those are recognized as "already ran", while all others not.

pditommaso commented 1 year ago

I'm not 100% sure that Data lake storage is fully compatible with Blob storage

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction#built-on-azure-blob-storage

adamrtalbot commented 1 year ago

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

Yes, I think so. I will generate a reproducible example shortly.

As Paolo says, Azure Data Lake is not 100% compatible with Blob storage despite being built on top of it, this breaks the caching system of Nextflow. Unfortunately Microsoft gloss over this in their documentation.

vsmalladi commented 1 year ago

@adamrtalbot , i think it might have to do with hierarchical name space. We could setup an engineering meeting and see if we can solve this.

adamrtalbot commented 1 year ago

I'm fairly sure it's because the .azure_blob_dir empty files in the Nextflow cache. Syncing either way breaks it on Data Lake. Haven't had a chance to make an example today.

adamrtalbot commented 1 year ago

If you run a Nextflow task on Azure using Data Lake gen 2 storage, it will work fine but Nextflow will be unable to cache the backup:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: java.nio.file.FileAlreadyExistsException: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099/b08f9190-09b6-40b4-8858-424b85f13070/.azure_blob_dir

If I clear out the remote directory and delete all .azure_blob_dir it works fine:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099

However Nextflow has added lots of .azure_blob_dir to the folder. If I run again I get another error even though the .azure_blob_dir files do not exist at source:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:e9d8006a-d01e-0004-0a35-c9f35b000000
Time:2023-08-07T13:42:33.3380342Z</Message></Error>"

I can't say for sure this is the problem but it seems likely. For now, use a normal Azure Blob storage account and only use a Data Lake if you don't mind losing the resume function.

bentsherman commented 1 year ago

Adam, can you try a run with the cloud cache? This will save the cache directly in object storage, no need to use the cache backup/restore commands. To use it simply set the NXF_CLOUDCACHE_PATH environment var:

export NXF_CLOUDCACHE_PATH="az://my-bucket/cache"
adamrtalbot commented 1 year ago

OK first run of Nextflow hello with, then run again with -resume: nextflow.azuredatalake.2.log nextflow.azuredatalake.1.log

bentsherman commented 1 year ago

Sorry, I only just realized that the cloud cache isn't being auto-enabled. I just made a PR to fix it, but for now you also have to enable the plugin explicitly:

plugins {
    id 'nf-cloudcache'
}

You should see the following line in your log:

DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.CloudCacheFactory
adamrtalbot commented 1 year ago

Take 2!

nextflow.azuredatalake.2.log nextflow.azuredatalake.1.log

bentsherman commented 1 year ago

Looks good

Aug-07 18:52:05.304 [Actor Thread 8] INFO  nextflow.processor.TaskProcessor - [df/934fdb] Cached process > sayHello (3)
Aug-07 18:52:05.566 [Actor Thread 10] INFO  nextflow.processor.TaskProcessor - [39/2edd7e] Cached process > sayHello (4)
Aug-07 18:52:05.574 [Actor Thread 6] INFO  nextflow.processor.TaskProcessor - [94/3c76ee] Cached process > sayHello (2)
Aug-07 18:52:05.656 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [61/695cf7] Cached process > sayHello (1)
bentsherman commented 1 year ago

Paolo is preparing a launcher image so that we can use the cloud cache in Tower... likely it will become the default for all cloud providers, so should resolve this issue. Nextflow-only users can also use the cloud cache with the above settings.

FabianDK commented 1 year ago

I have just tried the solution you proposed and it works: 1) I added the plugins in the config:

plugins {
        id 'nf-azure'
        id 'nf-cloudcache'
}

2) I run the workflow with version 23.07.0 (with resume = true in config):

NXF_VER=23.07.0-edge NXF_CLOUDCACHE_PATH="az://path_on_adls2" nextflow -log local_log_folder/nf_log local/workflow.nf -c local/nextflow.config

The caching then works. Thank you for the solution!

Another comment that I have is regarding the 'nf-cloudcache' plugin. We are usually developing workflows locally on premise with a config profile option for local execution, and another profile for azure batch execution.

When the 'nf-cloudcache' plugin is specified in the config and we do a local execution without specifying a cloud cache dir, we get: ERROR ~ NXF_CLOUDCACHE_PATH must be defined when using the path-based cache store.

A solution would be to make plugins conditional on a chosen config profile, but this is currently not possible.

bentsherman commented 1 year ago

Hi @FabianDK , we just made a change to do exactly this, the plugin will be enabled automatically if NXF_CLOUDCACHE_PATH is set. That will be available in the next release.

pditommaso commented 5 months ago
adamrtalbot commented 5 months ago

@alliemclean is still reporting this error on Slack. I've been unable to reproduce it with the following example (test data from original post).

main.nf:

#!/usr/bin/env nextflow
/**********************************************************************
 * Parameter defaults
 **********************************************************************/
params.lines=100000
params.in = "az://container/GCA_000001215.4_current_ids.vcf.gz"
params.name = "test_name"

/**********************************************************************
 * Splits a VCF file in multiple files.
 *   One file per params.lines variants
 **********************************************************************/
process split {

    machineType "Standard_E*d_v5" // Not important, just for our system

    cpus 1
    memory '1GB'

    container 'lethalfang/tabix:1.10'
    cache = 'deep'

    input:
    path(f)

    output:
    file("${params.name}_*.vcf")

    shell:
    """
    echo "Decompressing ${f}"
    bgzip -cd ${f} > temp.vcf
    sed -n '/^#/p' temp.vcf > header.vcf
    echo "Splitting temp.vcf"
    split -l ${params.lines} temp.vcf vcf_
    echo "Done splitting"

    counter=0
    for file in vcf_*
    do
        sed -i '/^#/d' \$file
        cat header.vcf \$file > ${params.name}_\$counter.vcf
        let counter=counter+1
    done
    """
}
/*********************************************************************
 *  Main workflow
 *********************************************************************/
workflow {

    // General genomics inputs
    def vcf = Channel.fromPath(params.in, checkIfExists: true)

    main:

        // Split the VCF by number of lines (ie., 50000)
        split(vcf)
}

nextflow.config

process.executor = 'azurebatch'
azure {
    storage {
        accountName = 'mybatchaccount'
    }
    batch {
        location = 'eastus'
        accountName = 'mystorageaccount'
        copyToolInstallMode = 'node'
        autoPoolMode = true
        allowPoolCreation = true
        deletePoolsOnCompletion = false
        auto {
            autoScale = false
            vmCount = 1
        }
    }
    activeDirectory {
        servicePrincipalId     = "$AZURE_DIRECTORY_TENANT_ID"
        servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
        tenantId               = "$AZURE_APPLICATION_TENANT_ID" 
    }
}
workDir = 'az://work/'
landroutsosAIE commented 3 months ago

Hello guys! We fixed our initial problem of not being able to resume using the NXF_CLOUDCACHE_PATH, but now we can't resume the pipeline because the pools are deleted upon completion. We are having the error: Pool not in active state. I understand that is normal behaviour since the pool where the process ran is deleted We dont want to have a lot of pools active at the same time, since the Azure costs add up, but we would love to be able to resume any pipelines that didnt run. Do you have any ideas how to fix this resume error?

adamrtalbot commented 3 months ago

This will re-use the existing pools while not costing you any additional money: