nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.73k stars 625 forks source link

Error trying to pull Azure Open Data sets #2595

Closed vsmalladi closed 2 years ago

vsmalladi commented 2 years ago

Bug report

Expected behavior and actual behavior

Expected to be able to provide path to Azure open data sets and download using the https path.

However it tries to resolve using the sas token and azure blob storage account provided in the config

Steps to reproduce the problem

Use the nf-core/sarek repo and use the following genomes.config

'custom' {
  fasta                   = "https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0/Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D"
  snpeff_db               = 'GRCh38.86'
  species                 = 'homo_sapiens'
  vep_cache_version       = '99'
}

nextflow run sarek/main.nf --igenomes_ignore --genomes_base 'az://’ --tools HaplotypeCaller --genome 'custom'

Program output

Error executing process > 'BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D)'

Caused by: Process BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D) terminated with an error exit status (1)

Command executed:

samtools faidx Homo_sapiens_assembly38.fasta?sv=2020-04-08\&si=prod\&sr=c\&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D

Command exit status: 1

Command output: (empty)

Command wrapper: Unable to download path: https://havocdata.blob.core.windows.net/work/stage/b4/7771dc4451be7737063833b9d7674c/Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ=

Work dir: az://work/f8/2092c2c8afb6c25d0b4391ab5680f3

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

Environment

Additional context

(Add any other context about the problem here)

pditommaso commented 2 years ago

Nextflow uses azcopy to pull data into the container, but I have no idea why is not working for open dataset.

https://github.com/nextflow-io/nextflow/blob/c3677c31126ecdcb095a51f1bfe278be1a842011/plugins/nf-azure/src/main/nextflow/cloud/azure/file/AzBashLib.groovy#L51-L65

vsmalladi commented 2 years ago

Ya i will test with the newest edge release once its out.

pditommaso commented 2 years ago

There's no changes at this regard relating to this problem. Wondering if there's some specific azcopy option to access public data.

vsmalladi commented 2 years ago

Ya i can look at the code further.

abhi18av commented 2 years ago

Guys, I'm looking into this one and here are some observations

  1. Using an authenticated azcopy (after azcopy login), I was able to download the file (with the SAS token).
  2. Interestingly, the name of the downloaded file was simply Homo_sapiens_assembly38.fasta, as opposed to the Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D shown in the initial comment - not sure why this might be happening 🤔

Will keep you posted here in case I find the root cause.

pditommaso commented 2 years ago

That could explain we may need to add a azcopy login in the script initialization

https://github.com/nextflow-io/nextflow/blob/202b5c9c93a8972231c938d9a646d09e8a790424/plugins/nf-azure/src/main/nextflow/cloud/azure/file/AzBashLib.groovy#L33-L33

vsmalladi commented 2 years ago

Shouldn’t need to login since the sas token is passed if it’s a url. Should be able to do a wget like any url that’s public right?

abhi18av commented 2 years ago

Yup, the azcopy (without azcopy login) doesn't need any authentication and downloads the file as expected

(base)~/projects/_scratch$ azcopy copy 'https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D' ./
INFO: Scanning...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb has started
Log file is located at: /home/abhinav/.azcopy/ae1a4f3f-6ebf-0142-5af8-950765e5c2eb.log

0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 12.2113

Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb summary
Elapsed Time (Minutes): 0.0667
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 3053999
Final Job Status: Completed

(base) ~/projects/_scratch$ ls
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz  

@vsmalladi , possible for you to share the .command.run and .command.sh from the relevant workdirectory az://work/f8/2092c2c8afb6c25d0b4391ab5680f3 ?

vsmalladi commented 2 years ago

@abhi18av Will need to rerun as I deleted that work directory.

I wonder if this part of bigger discussion of how to download data from multiple blob storage accounts with multiple sas tokens.

vsmalladi commented 2 years ago

nextflow.log @abhi18av the newest version is trying to stage the file but can't. Uploaded the log. No .command.run or .command.sh in the stage/work directory.

abhi18av commented 2 years ago

Thanks @vsmalladi for sharing these, but the errors here seem to be different from the ones mentioned in the first comment https://github.com/nextflow-io/nextflow/issues/2595#issue-1115515572

Psting here some crucial data points

Feb-03 11:04:06.085 [main] DEBUG nextflow.cli.Launcher - $> nextflow run nf-core/sarek -c /ARQUIVOS/data/azure --igenomes_ignore --genomes_base 'az://genomas-raros/sarek' --genome custom --input 'az://genomas-raros/sarek_azure.tsv' --outdir 'az://genomas-raros/results' --tools HaplotypeCaller -w 'az://genomas-raros/work' -profile docker

Feb-03 11:04:06.149 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 21.10.6

Feb-03 11:04:09.702 [main] INFO org.pf4j.AbstractPluginManager - Start plugin 'nf-azure@0.11.2'

Feb-03 11:04:07.413 [main] DEBUG nextflow.scm.AssetManager - Git config: /root/.nextflow/assets/nf-core/sarek/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/sarek.git

Feb-03 15:50:20.371 [main] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.err
Feb-03 15:50:20.392 [main] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.out
Feb-03 15:50:20.522 [main] ERROR nextflow.script.WorkflowMetadata - Failed to invoke `workflow.onComplete` event handler
java.lang.NullPointerException: Cannot invoke method size() on null object
    at org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:91)
    at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:44)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:130)
    at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41:3955)
    at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

...
...
vsmalladi commented 2 years ago

@abhi18av sorry uploaded the wrong log. Was debuging another persons log. Uploading no nextflow.log w

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abhi18av commented 2 years ago

I recently saw some change here https://github.com/nextflow-io/nextflow/issues/2918 dealing with the query params for files sourced via HTTP(s) location, which might address this functionality, unless I'm mistaken.

Worth testing again as soon as the latest edge is out.