Closed vsmalladi closed 2 years ago
Nextflow uses azcopy
to pull data into the container, but I have no idea why is not working for open dataset.
Ya i will test with the newest edge release once its out.
There's no changes at this regard relating to this problem. Wondering if there's some specific azcopy
option to access public data.
Ya i can look at the code further.
Guys, I'm looking into this one and here are some observations
azcopy
(after azcopy login
), I was able to download the file (with the SAS token).Homo_sapiens_assembly38.fasta
, as opposed to the Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D
shown in the initial comment - not sure why this might be happening 🤔 Will keep you posted here in case I find the root cause.
That could explain we may need to add a azcopy login
in the script initialization
Shouldn’t need to login since the sas token is passed if it’s a url. Should be able to do a wget like any url that’s public right?
Yup, the azcopy
(without azcopy login
) doesn't need any authentication and downloads the file as expected
(base)~/projects/_scratch$ azcopy copy 'https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D' ./
INFO: Scanning...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb has started
Log file is located at: /home/abhinav/.azcopy/ae1a4f3f-6ebf-0142-5af8-950765e5c2eb.log
0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 12.2113
Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb summary
Elapsed Time (Minutes): 0.0667
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 3053999
Final Job Status: Completed
(base) ~/projects/_scratch$ ls
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
@vsmalladi , possible for you to share the .command.run
and .command.sh
from the relevant workdirectory az://work/f8/2092c2c8afb6c25d0b4391ab5680f3
?
@abhi18av Will need to rerun as I deleted that work directory.
I wonder if this part of bigger discussion of how to download data from multiple blob storage accounts with multiple sas tokens.
nextflow.log @abhi18av the newest version is trying to stage the file but can't. Uploaded the log. No .command.run or .command.sh in the stage/work directory.
Thanks @vsmalladi for sharing these, but the errors here seem to be different from the ones mentioned in the first comment https://github.com/nextflow-io/nextflow/issues/2595#issue-1115515572
Psting here some crucial data points
Feb-03 11:04:06.085 [main] DEBUG nextflow.cli.Launcher - $> nextflow run nf-core/sarek -c /ARQUIVOS/data/azure --igenomes_ignore --genomes_base 'az://genomas-raros/sarek' --genome custom --input 'az://genomas-raros/sarek_azure.tsv' --outdir 'az://genomas-raros/results' --tools HaplotypeCaller -w 'az://genomas-raros/work' -profile docker
21.10.6
Feb-03 11:04:06.149 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 21.10.6
nf-azure
version - nf-azure 0.11.2
Feb-03 11:04:09.702 [main] INFO org.pf4j.AbstractPluginManager - Start plugin 'nf-azure@0.11.2'
Feb-03 11:04:07.413 [main] DEBUG nextflow.scm.AssetManager - Git config: /root/.nextflow/assets/nf-core/sarek/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/sarek.git
.command.err
and .command.out
Feb-03 15:50:20.371 [main] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.err
Feb-03 15:50:20.392 [main] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.out
workflow.onComplete
, probably the mail trigger mechanism https://github.com/nf-core/sarek/blob/68b9930a74962f3c42eee71f51e6dd2646269199/main.nf#L3879Feb-03 15:50:20.522 [main] ERROR nextflow.script.WorkflowMetadata - Failed to invoke `workflow.onComplete` event handler
java.lang.NullPointerException: Cannot invoke method size() on null object
at org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:91)
at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:44)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:130)
at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41:3955)
at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
...
@abhi18av sorry uploaded the wrong log. Was debuging another persons log. Uploading no nextflow.log w
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I recently saw some change here https://github.com/nextflow-io/nextflow/issues/2918 dealing with the query params for files sourced via HTTP(s) location, which might address this functionality, unless I'm mistaken.
Worth testing again as soon as the latest edge is out.
Bug report
Expected behavior and actual behavior
Expected to be able to provide path to Azure open data sets and download using the https path.
However it tries to resolve using the sas token and azure blob storage account provided in the config
Steps to reproduce the problem
Use the nf-core/sarek repo and use the following genomes.config
nextflow run sarek/main.nf --igenomes_ignore --genomes_base 'az://’ --tools HaplotypeCaller --genome 'custom'
Program output
Error executing process > 'BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D)'
Caused by: Process
BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D)
terminated with an error exit status (1)Command executed:
samtools faidx Homo_sapiens_assembly38.fasta?sv=2020-04-08\&si=prod\&sr=c\&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D
Command exit status: 1
Command output: (empty)
Command wrapper: Unable to download path: https://havocdata.blob.core.windows.net/work/stage/b4/7771dc4451be7737063833b9d7674c/Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ=
Work dir: az://work/f8/2092c2c8afb6c25d0b4391ab5680f3
Tip: when you have fixed the problem you can continue the execution adding the option
-resume
to the run command lineEnvironment
Additional context
(Add any other context about the problem here)