Open alliemclean opened 1 year ago
Thanks for letting us know about this issue @alliemclean , allow me some time to circle back once I've investigated this further.
Hey @alliemclean , circling back on this one after a lot of attempts unfortunately I'm not seeing this occur in my env.
That being said, I did tweak the configuraition file to remove the fileshare - which shouldn't affect anything in the pipeline since that's not being referenced in the analysis.
Hi @abhi18av I tried removing the fileshare and it still didn't resume. I'm having the same problem with 2 out of 3 pipelines I'm running. The other one resumes one of the jobs in the pipeline but not the rest. The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple. Could this be the issue? Is there a way to check whether the cache has a problem with the process's output?
Could there be anything within our azure environment that is causing the difference in this behavior?
Mmm, that might be the case - but it could also be depending on the design of those two pipelines.
The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple.
Hmm, I don't particularly see this should break the caching behavior 🤔
Is it a public NF pipeline which you are running? Perhaps an nf-core one which I could use to test things as well?
CC @vsmalladi for any inputs/prior-experience here?
It's not an nf-core one, but I suppose I could see if I can reproduce it with one, although the example I've given is complete and sufficient to cause the issue I'm having on my end. I can try to reproduce with this module, snpsift_split when I have time.
@alliemclean can you quote the az parts
nextflow run pipeline_test.nf \ --name "resume_test_azure" \ --in "az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz" \ --lines 10000 \ --outdir "az://pipeline/resume_results/" \ -w "az://pipeline/resume_working/" \ -c azure.config \ -dump-hashes \ -resume
@alliemclean Are you running different pipelines (with and without "-resume") from the same folder? Nextflow get some values from .nextflow/history file. So maybe try run Azure from different folder if so.
Try to run pipeline with parameter "-trace nextflow" e.g. nextflow -trace nextflow run hello.nf -resume .... Next in .nextflow.log check line "Cacheable folder=" (is it path or null)?
This is the second run and it didn't use the cache. I tried in it's own folder. Here some parts of the log I thought might be relevant:
I think that the problem may be connected with exitCode for task . Only "0" is success code. The exit code is one of the components that allows the task to be treated as cacheable.
The log on a first screenshot is ok.
It was successful, but I ran it many times so this one could have been after an unsuccessful run. Here's another try after a successful run, but Cacheable folder=null. Is there a reason it may not pick it up? actual run folder doesn't have the az:// prefix:
Apr-21 17:33:45.559 [Actor Thread 6] TRACE nextflow.processor.TaskProcessor -
Hi everyone,
I have the same issue. Running a very simple script with azure batch that is not being cached.
In my case, I believe that my nextflow run identifies the previously successfully run workflow, but for some reason does not find the output file (.txt) although it is in the workdir (on the azure storage).
Jun-28 17:51:41.619 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - [MAKE_TEST_FILE_HOST] Missed cache > Missing output file(s) `*.txt` expected by process `MAKE_TEST_FILE_HOST` -- folder: /nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7
Jun-28 17:51:41.620 [Actor Thread 5] TRACE nextflow.file.FileSystemPathFactory - File system path factories: [nextflow.cloud.azure.file.AzPathFactory@436bd4df]
I assume an explanation could be that nextflow is trying to look at a local workDir folder instead of the one in the cloud (az://nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7).
I may have found the problem here, could the people seeing this confirm which type of storage account they are using? Are they using Data Lake gen2 storage?
If so, could people have a look inside the cache dir on Azure storage and tell me if there are any contents?
We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?
The workDir and its contents are present on the Azure storage (config setting: workDir = "az://blob_container/folder/nf_work/"
) and so are the publishDir/copy output files (config setting: params.outdir = "az://blob_container/folder/data_tmp/nf_output"
). We are not using the fileShares option.
After running a more complex RNAseq workflow, it is interesting to note that only the first 2 processes are cached. Unclear why exactly those are recognized as "already ran", while all others not.
I'm not 100% sure that Data lake storage is fully compatible with Blob storage
We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?
Yes, I think so. I will generate a reproducible example shortly.
As Paolo says, Azure Data Lake is not 100% compatible with Blob storage despite being built on top of it, this breaks the caching system of Nextflow. Unfortunately Microsoft gloss over this in their documentation.
@adamrtalbot , i think it might have to do with hierarchical name space. We could setup an engineering meeting and see if we can solve this.
I'm fairly sure it's because the .azure_blob_dir
empty files in the Nextflow cache. Syncing either way breaks it on Data Lake. Haven't had a chance to make an example today.
If you run a Nextflow task on Azure using Data Lake gen 2 storage, it will work fine but Nextflow will be unable to cache the backup:
nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: java.nio.file.FileAlreadyExistsException: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099/b08f9190-09b6-40b4-8858-424b85f13070/.azure_blob_dir
If I clear out the remote directory and delete all .azure_blob_dir
it works fine:
nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
However Nextflow has added lots of .azure_blob_dir
to the folder. If I run again I get another error even though the .azure_blob_dir
files do not exist at source:
nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:e9d8006a-d01e-0004-0a35-c9f35b000000
Time:2023-08-07T13:42:33.3380342Z</Message></Error>"
I can't say for sure this is the problem but it seems likely. For now, use a normal Azure Blob storage account and only use a Data Lake if you don't mind losing the resume function.
Adam, can you try a run with the cloud cache? This will save the cache directly in object storage, no need to use the cache backup/restore commands. To use it simply set the NXF_CLOUDCACHE_PATH
environment var:
export NXF_CLOUDCACHE_PATH="az://my-bucket/cache"
OK first run of Nextflow hello with, then run again with -resume
:
nextflow.azuredatalake.2.log
nextflow.azuredatalake.1.log
Sorry, I only just realized that the cloud cache isn't being auto-enabled. I just made a PR to fix it, but for now you also have to enable the plugin explicitly:
plugins {
id 'nf-cloudcache'
}
You should see the following line in your log:
DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.CloudCacheFactory
Looks good
Aug-07 18:52:05.304 [Actor Thread 8] INFO nextflow.processor.TaskProcessor - [df/934fdb] Cached process > sayHello (3)
Aug-07 18:52:05.566 [Actor Thread 10] INFO nextflow.processor.TaskProcessor - [39/2edd7e] Cached process > sayHello (4)
Aug-07 18:52:05.574 [Actor Thread 6] INFO nextflow.processor.TaskProcessor - [94/3c76ee] Cached process > sayHello (2)
Aug-07 18:52:05.656 [Actor Thread 4] INFO nextflow.processor.TaskProcessor - [61/695cf7] Cached process > sayHello (1)
Paolo is preparing a launcher image so that we can use the cloud cache in Tower... likely it will become the default for all cloud providers, so should resolve this issue. Nextflow-only users can also use the cloud cache with the above settings.
I have just tried the solution you proposed and it works: 1) I added the plugins in the config:
plugins {
id 'nf-azure'
id 'nf-cloudcache'
}
2) I run the workflow with version 23.07.0 (with resume = true
in config):
NXF_VER=23.07.0-edge NXF_CLOUDCACHE_PATH="az://path_on_adls2" nextflow -log local_log_folder/nf_log local/workflow.nf -c local/nextflow.config
The caching then works. Thank you for the solution!
Another comment that I have is regarding the 'nf-cloudcache' plugin. We are usually developing workflows locally on premise with a config profile option for local execution, and another profile for azure batch execution.
When the 'nf-cloudcache' plugin is specified in the config and we do a local execution without specifying a cloud cache dir, we get: ERROR ~ NXF_CLOUDCACHE_PATH must be defined when using the path-based cache store
.
A solution would be to make plugins conditional on a chosen config profile, but this is currently not possible.
Hi @FabianDK , we just made a change to do exactly this, the plugin will be enabled automatically if NXF_CLOUDCACHE_PATH
is set. That will be available in the next release.
@alliemclean is still reporting this error on Slack. I've been unable to reproduce it with the following example (test data from original post).
main.nf:
#!/usr/bin/env nextflow
/**********************************************************************
* Parameter defaults
**********************************************************************/
params.lines=100000
params.in = "az://container/GCA_000001215.4_current_ids.vcf.gz"
params.name = "test_name"
/**********************************************************************
* Splits a VCF file in multiple files.
* One file per params.lines variants
**********************************************************************/
process split {
machineType "Standard_E*d_v5" // Not important, just for our system
cpus 1
memory '1GB'
container 'lethalfang/tabix:1.10'
cache = 'deep'
input:
path(f)
output:
file("${params.name}_*.vcf")
shell:
"""
echo "Decompressing ${f}"
bgzip -cd ${f} > temp.vcf
sed -n '/^#/p' temp.vcf > header.vcf
echo "Splitting temp.vcf"
split -l ${params.lines} temp.vcf vcf_
echo "Done splitting"
counter=0
for file in vcf_*
do
sed -i '/^#/d' \$file
cat header.vcf \$file > ${params.name}_\$counter.vcf
let counter=counter+1
done
"""
}
/*********************************************************************
* Main workflow
*********************************************************************/
workflow {
// General genomics inputs
def vcf = Channel.fromPath(params.in, checkIfExists: true)
main:
// Split the VCF by number of lines (ie., 50000)
split(vcf)
}
nextflow.config
process.executor = 'azurebatch'
azure {
storage {
accountName = 'mybatchaccount'
}
batch {
location = 'eastus'
accountName = 'mystorageaccount'
copyToolInstallMode = 'node'
autoPoolMode = true
allowPoolCreation = true
deletePoolsOnCompletion = false
auto {
autoScale = false
vmCount = 1
}
}
activeDirectory {
servicePrincipalId = "$AZURE_DIRECTORY_TENANT_ID"
servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
tenantId = "$AZURE_APPLICATION_TENANT_ID"
}
}
workDir = 'az://work/'
Hello guys! We fixed our initial problem of not being able to resume using the NXF_CLOUDCACHE_PATH, but now we can't resume the pipeline because the pools are deleted upon completion.
We are having the error: Pool
This will re-use the existing pools while not costing you any additional money:
azure.batch.deletePoolsOnCompletion
so it does not delete the pools and can reuse them after each run.
Bug report
resume option works locally but not with azure batch. I've gone through all the troubleshooting steps and can't find the reason the cache is invalidated
Expected behavior and actual behavior
resume should use the cache and not re-process cached jobs
Steps to reproduce the problem
link to input
Program output
Uploaded logs for local and batch runs, run command lines, and config files: test_resume.tar.gz
Environment
Additional context
resume works with local execution and not with azure. Cache hashes are all the same as far as I can tell.
resume is also working on batch (same account, same settings) with a different pipeline.