ncbi / sra-tools

SRA Tools
Other
1.12k stars 246 forks source link

fasterq-dump works for protected dbGaP data? #287

Closed rebexxxxx closed 3 years ago

rebexxxxx commented 4 years ago

I have been trying to download protected dbGaP data via an aws instance and am experiencing extremely slow download speeds. I have tried using a variety of instances (with different download speeds, the fastest reported is 25MBPS) but its taking at least 40 minutes per SRR ID (which seems really slow to me). Im using the code: fastq-dump --split-3 SRR******* I have been told to try out the new fasterq-dump function but read that this is not compatible with dbGaP data, is this still true? i downloaded and installed version 2.10.2 and tried using the fasterq-dump but I get errors which leads me to believe I cant use it with dbGaP data. thanks!

kwrodarmer commented 4 years ago

Depending upon the exact runs you are accessing, fasterq-dump-2.10.2 may segfault. This has been addressed with a new version released today. See our download page for 2.10.3.

If you continue to have issues, please write back and we'll help you get there.

wraetz commented 4 years ago

The low speed does not originate from the network-speed. 40 minutes per SRR is absolutely normal. The SRR-archive is a special compressed database format. If you are asking for data in fastq-format, the equivalent of a view is extracted from this database. This process involves decompression and other transformations. The newer tool 'faster-dump' is trying to speed this process up - by trading speed for space. In other words you need a lot of temporary space while the transformation is performed. This tool also uses more threads then fastq-dump. There has never been a difference between fastq-dump and fasterq-dump regarding dbGaP-data. Both tools rely on the same library underneath them - they are both capable of handling dbGaP-data because of that. If you have errors using either tool - it may originate from network-problems. The best solution is to first download the SRA-archive to your local storage with the 'prefetch' tool, then perform the format-conversion with 'fasterq-dump'.

rebexxxxx commented 4 years ago

If you continue to have issues, please write back and we'll help you get there.

ok, thanks for your help thus far. I downloaded 2.10.3, when I run this code: fasterq-dump SRR9951664

I produce this error:

2020-02-18T19:02:51 fasterq-dump.2.10.2 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR9951664' - Access denied - please request permission to access phs001709 / GRU in dbGaP ( 403 ) Query SRR9951664: Error 403 Access denied - please request permission to access phs001709 / GRU in dbGaP 2020-02-18T19:02:51 fasterq-dump.2.10.2 err: invalid accession 'SRR9951664'

this is an SRR ID i know i can download, I am able to download this using fastq-dump but fasterq will not work.

kwrodarmer commented 4 years ago

Please pick up version 2.10.3 before going forward.

Next, can you show the command line you are using? With the 2.10.x series, our command line has changed in terms of how you specify access permissions.

rebexxxxx commented 4 years ago

Please pick up version 2.10.3 before going forward.

Next, can you show the command line you are using? With the 2.10.x series, our command line has changed in terms of how you specify access permissions.

so I have the .tar file downloaded for version 2.10.3, and I unzipped it then configured my project using the code: vdb-config --import prj_23539.ngc but when I try and use fasterq-dump i see that its using version 2.10.2. im not sure what im doing incorrectly?

do you want a screenshot of the error?

kwrodarmer commented 4 years ago

Okay, so the first thing is to no longer import the ngc file. Instead you specify it on the command line and keep the ngc file around for every use.

rebexxxxx commented 4 years ago

Please pick up version 2.10.3 before going forward.

Next, can you show the command line you are using? With the 2.10.x series, our command line has changed in terms of how you specify access permissions.

Screen Shot 2020-02-18 at 2 38 45 PM
rebexxxxx commented 4 years ago

ok, so then do I need to create a specific project file (similar to the project file that importing the ngc file would produce) then keep the ngc file in that folder and run all commands from there?

kwrodarmer commented 4 years ago

Yes, I realize that our documentation only mentions prefetch using the ngc file this way... we will have to correct that.

With 2.10.3, you no longer need to cd to the dbGaP directory, but it's alright if you do. Run the command as

$ fasterq-dump --ngc <path-to-ngc> SRR9951664 

and that should get you farther. @wraetz - any other comments?

rebexxxxx commented 4 years ago

Yes, I realize that our documentation only mentions prefetch using the ngc file this way... we will have to correct that.

With 2.10.3, you no longer need to cd to the dbGaP directory, but it's alright if you do. Run the command as

$ fasterq-dump --ngc <path-to-ngc> SRR9951664 

and that should get you farther. @wraetz - any other comments? now im getting this error:

Screen Shot 2020-02-18 at 2 46 47 PM
kwrodarmer commented 4 years ago

I'm very sorry - for the NGC file, the correct option was --ngc rather than --perm.

rebexxxxx commented 4 years ago

rather

now I get the error

Screen Shot 2020-02-18 at 2 51 24 PM
kwrodarmer commented 4 years ago

The reason for my confusion was that you are operating in the cloud. If you know how to obtain a JWT from dbGaP as a "passport" or "cart", these make use of the --perm option and would get you to the copy of data stored within AWS. When using the NGC, you are pulling data in from NCBI.

rebexxxxx commented 4 years ago

The reason for my confusion was that you are operating in the cloud. If you know how to obtain a JWT from dbGaP as a "passport" or "cart", these make use of the --perm option and would get you to the copy of data stored within AWS. When using the NGC, you are pulling data in from NCBI.

do I need to get a JWT from dbGaP? fasterq-dump is still not working for me

kwrodarmer commented 4 years ago

We will assign an engineer to look at your issue. Please be patient! Thanks.

skripche commented 4 years ago

Hello,

To get the benefit of downloading data into an AWS instance you need to configure the toolkit to report the instance identity and to do that you need to run vdb-config -i and select "AWS" and accept to report instance identify.

To test your that everything is working please go to Run Selector(do not login into your account for now): https://www.ncbi.nlm.nih.gov/Traces/study/?

If you do not login you will see a test dbGaP project that you can use for testing. Select the project and you will see all the runs for it. I suggest selecting a single run "SRR1219879" as it is small and easy to test with. Once you select that one run you will see the "JWT" button will be active. Click the button and this will download the JWT to your computer. The easiest way to transfer the JWT to your instance is to open it in Notepad/Wordpad and copy the contents and then open a text editor on your instance and paste the contents(I called my file cart.jwt).

At this point you can use this command ./fasterq-dump --perm cart.jwt --split-3 SRR1219879

The JWT has a 1 hour expiration for security, so if you do not start downloading within 1 hour it will expire and you will need to regenerate it. However, if you start downloading with the JWT and the entire data set your downloading will take longer than 1 hour then the download will not be interrupted. You should also be aware that you can only download the accessions you selected when creating the cart.

darwinchangz commented 4 years ago

Hello. Unrelated user with a similar issue. I currently work on a company computational cluster, so not a cloud instance. I have an issue similar where when using the project .ngc file, the command fails. here are both the ngc and perm options used.


fasterq-dump --ngc ~/wget/prj_22175_D26509.ngc -O ~/Zheng_SCC SRR2194571
Failed to call external services.

fasterq-dump --perm ~/Zheng_SCC/cart.jwt -O ~/Zheng_SCC/ SRR2194571 
Currently, --perm can only be used from inside a cloud computing environment.
Please run inside of a supported cloud computing environment, or get an ngc file from dbGaP and reissue the command with --ngc <ngc file> instead of --perm <perm file>.```
kwrodarmer commented 4 years ago

The correct invocation is to use the --ngc option with an ngc token, not a jwt; the latter is only usable right now from within the cloud. Please run with that option and show us the response.

darwinchangz commented 4 years ago

sorry @kwrodarmer the first line of text was not caught in the line of code. I used our dbgap ngc file for the --ngc option

kwrodarmer commented 4 years ago

Can you show the response? In fact, can you just run it and capture both the command and its output?

darwinchangz commented 4 years ago
Screen Shot 2020-02-19 at 1 52 55 PM
kwrodarmer commented 4 years ago

Please write to us at sra-tools@ncbi.nlm.nih.gov .

darwinchangz commented 4 years ago

sent an email

cemalley commented 4 years ago

Hi all, I am also accessing dbGaP data on AWS and here it what is working for me. You guys at SRA/NCBI are slow replying to email or don't reply at all. Can you make updating and unifying documentation pages a priority.

  1. first of all there is no cart download available in SRA run selector when I search for my project - it's greyed out - so I have to use a list of SRR accession IDs.

  2. set up vdb-config to import the ngc file and set the download directory to one place not in root where there is a lot of storage. the download directory should be listed as the same place everywhere possible in configuration. in the Tools tab: the download to public repository has to be checked. also, "accept charges" on AWS has to be checked. vdb-config -i.

  3. here's an example of getting one SRA file. while sitting in the directory indicated in vdb-config, I used prefetch: prefetch -c SRR###

  4. stay in the vdb-configured directory. convert sra to fastq. fastq-dump -I --split-files sra/SRR###.sra It doesn't work if I just put SRR ID and claims I don't have access. It wants the relative path I guess.

  5. I parallelized this into chunks of 15 at a time. I notice there may be network issues if I try to do more. I put the following R output into a commands text file.

for (i in accessions){
  cat(paste0('screen -dm bash -c \'prefetch -c ', i, ' ; fastq-dump -I --split-files --skip-technical sra/',i,'.sra\' ', '\n'))
}
# AWS tends to timeout without screen running.
parallel -j 15 < commands.txt

Monitoring the transfer with htop. I'm using fastq-dump and prefetch version 2.10.0.

darwinchangz commented 4 years ago

You have to select all the files in the run selector to create a cart. selecting files ungreys the box You need to use the options ngc or cart when running fastq-dump at least as far as I know. They specifically say that release 2.10.2 is when they allow AWS/GCP controlled-access downloads via cart (https://github.com/ncbi/sra-tools/blob/master/README.md)

Not saying that sratoolkit has poor documentation, but it's pretty clear why you have to go with this roundabout way of downloading samples.

skripche commented 4 years ago

@cemalley I apologize for the documentation being outdated as we are working on it right now. Please make sure to use the 2.10.2 or 2.10.3 versions of the toolkit

Please refer to the above post on how to use the JWT to get access to dbGaP data in the cloud as you will get faster transfer speeds and you will be able to skip the decrypt skip if you use JWT instead of the NGC. Also, you will not need to throttle yourself as the data is no longer moving from our servers at Bethesda. This would make it very easy for you to simply run "fasterq-dump --perm jwt.cart SRR####" and generate the fastq file quickly without using prefetch. If you wish to use NGC file to download data, at this time it would be better to use the 2.9.6 version as we identified a bug in the 2.10.3 version and are actively working to resolve it.

If you are working in AWS you need to create your instance in us-east-1 as other regions will require egress charges for which you need to provide payment information and accept payment confirmation in the VDB config. If the payment information and the acceptance to pay are not both provided then your data is transferred from Bethesda and will be encrypted and you will need to have your NGC file. If you go to the configuration step of the Wiki you will see you only need to report your instance identity as the configuration step to stream data from the cloud buckets instead of the Bethesda servers as long as you are in the correct region.

Please be aware that NGC files are no longer imported into the SRA Toolkit configuration and need to be specified on the command every time. However, you no longer need to be in the workspace area that used to be created. This should make it easier to run the toolkit from any area you wish and you simply need to modify your command line in your script.

skripche commented 4 years ago

Hello,

The 2.10.4 version of the toolkit has been released and it fixes the NGC file problems that were occurring with downloading SRA data from dbGaP.

Please update your toolkit and let us know if the problem has been resolved. Thank you.

darwinchangz commented 4 years ago
Screen Shot 2020-02-26 at 9 03 34 AM

Seems like the issue is the same. Do I need to recompile ngs/ncbi-vdb again? The only issue when compiling was

chmod: cannot access `libutf8proc.so.2.2.0': No such file or directory

I had to download libutf8proc.so.2.2.0 and move that specific file over to sra-tools/tools/driver-tool/utf8proc

kwrodarmer commented 4 years ago

Did you build sra-tools or download it? Seems like you built it.

If so, you should ALWAYS update ncbi-vdb first, because most of the functionality is in VDB.

darwinchangz commented 4 years ago

when compiling ncbi-vdb, I get:

/usr/bin/ld: /home/shared/cbc/local/lib/libz.a(crc32.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC

This was something I didn't face in 2.10.3, so I'm not sure how to solve this

kwrodarmer commented 4 years ago

Is there any possibility you could use the pre-compiled binaries?

klymenko commented 4 years ago

@darwinchangz , 2.10.4 version of the toolkit is available on https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

Your fasterq-dump command does not fail on 2.10.4.

darwinchangz commented 4 years ago

Currently I am on a cluster that operates on glibc-2.12, and sratoolkit requires glibc-2.14. I'll see what I can do, but I didn't have the above error before.

darwinchangz commented 4 years ago

I was able to locally install glibc-2.14, and it works! Thanks

kwrodarmer commented 4 years ago

great!

darwinchangz commented 4 years ago

Upon further inspection, I realized that a temporary folder was being made, which I believed to have meant that the command was working. When attempting to run the command fasterq-dump a temporary folder is created in my current directory, but no fastq files are being made at the same time. prefetch and fastq-dump both work so I know it's not a problem with the setup. This issue arises even with fastq's that don't require an ngc file, like SRR000001.

[changd3@n6 test]$ fastq-dump SRR000001
Read 470985 spots for SRR000001
Written 470985 spots for SRR000001
[changd3@n6 test]$ mkdir test_1 && cd test_1 && fasterq-dump SRR000001

fastq-dump does the job, but fasterq-dump doesn't work at all. It just pauses.

kwrodarmer commented 4 years ago

Have you read https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump ?

darwinchangz commented 4 years ago

Using the -p command, I'm stuck at join :| 0.00%

I've read the commands and am only using the most basic, so I'm not entirely sure why it's not moving forward

kwrodarmer commented 4 years ago

Probably the most important part is to ensure you have enough scratch space for fasterq-dump. I'm not saying I can guess whether you do or don't: just that there are a number of conditions that can contribute to poor performance.

fasterq-dump can be up to 10x faster, but if the necessary conditions described in the "how-to" are not available, it may not be the right tool. I can only guess what is going on.

cemalley commented 4 years ago

Hi again, my instance is in us-east-1 North Virginia and I have downloaded from dbGaP before, but the machine lost connectivity partway through and I had to reboot the machine. The toolkit was not working anymore because prefetch did nothing. I updated to 2.10.4 and followed the wiki instructions carefully. I put my account accessKey.csv for AWS IAM in the location specified in credentials. I made a new directory for user-repository and process-local. I have 16 TB available on this volume. But it is still wrongly configured. When I use @skripche's example and the freshly generated cart file from run selector, here is the error:

[root@ip-## dump2]# fasterq-dump --perm cart.jwt --split-3 SRR1219879
2020-02-28T13:18:52 fasterq-dump.2.10.4 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR1219879' - Not in cart file you provided ( 403 )
Query SRR1219879: Error 403 Not in cart file you provided
2020-02-28T13:18:52 fasterq-dump.2.10.4 err: invalid accession 'SRR1219879'

The minimal example from the instructions:

[root@ip-## dump2]# fasterq-dump SRR000001
2020-02-28T13:43:41 fasterq-dump.2.10.4 err: encryption key not found while identifying column within <INVALID-MODULE> module - error with https open 'https://sra-pub-run-7.s3.amazonaws.com/SRR000001/SRR000001.4'
2020-02-28T13:43:41 fasterq-dump.2.10.4 err: invalid accession 'SRR000001'

And this is the error when using my ngc file for my dbGaP project (X'ing out what looks like a hash key):

[root@ip-## dump2]# fasterq-dump --ngc prj_22947_D23467.ngc SRR3583886
2020-02-28T13:16:11 fasterq-dump.2.10.4 err: empty while validating file within network system module - error with https open 'https://trace.ncbi.nlm.nih.gov/Traces/sdlr/sdlr.cgi?jwt=eyJXX...'
2020-02-28T13:16:11 fasterq-dump.2.10.4 err: invalid accession 'SRR3583886'

So something is wrong with the configuration still. Here is prefetch output too (shortened the keys):

[root@ip-## dump2]# prefetch SRR1219879

2020-02-28T13:34:15 prefetch.2.10.4 int: empty while validating file within network system module - cannot open remote file: https://trace.ncbi.nlm.nih.gov/Traces/sdlr/sdlr.cgi?jwt=eyJ...
2020-02-28T13:34:16 prefetch.2.10.4: 1) Downloading 'SRR1219879'...
2020-02-28T13:34:16 prefetch.2.10.4:  Downloading via https...
2020-02-28T13:34:16 prefetch.2.10.4 int: self NULL while reading file within network system module - Cannot KStreamRead: https://trace.ncbi.nlm.nih.gov/Traces/sdlr/sdlr.cgi?jwt=eyJ...
2020-02-28T13:34:16 prefetch.2.10.4:  https download failed
2020-02-28T13:34:16 prefetch.2.10.4: 1) failed to download SRR1219879

For what it's worth I tried version 2.9.4 of the toolkit too. Thanks in advance for any suggestions.

yatongli commented 4 years ago

Hello, I am using version 2.10.5 and have similar issues. Below are the commands I tried and their errors in parenthesis:

sratoolkit.2.10.5-ubuntu64/bin/fasterq-dump SRR1219879 (Error 403 Access denied - please request permission to access phs000710 / UR in dbGaP; invalid accession 'SRR1219879') sratoolkit.2.10.5-ubuntu64/bin/fasterq-dump --ngc ncbi/prj_24162.ngc SRR1219879 fasterq-dump.2.10.5 err: invalid accession 'SRR1219879'

Thank you for your time and help!

cemalley commented 4 years ago

Hi @yatongli the following format and steps worked for me. I saved it in a gist: https://gist.github.com/cemalley/5951e146899b4fc0edfc8a4d5474bc9b It only worked for me with a cart file and prefetch, then a fasterq-dump command when located in the sra folder created during sratoolkit setup. I'll paste the gist here. Hope this works for you.

# download cart file from NCBI Trace.
# copy paste contents into a cart.jwt named file on the AWS server.
# sratoolkit should be installed on the server.
# cd to the sra folder.

prefetch --perm cart.jwt SRR1 SRR2 SRR3

# prefetch can only handle about 100-120 listed SRR IDs at a time. the cart is only valid for 1 hour, so I set up batches of prefetch commands in screen sessions.

# once .sra files are downloaded, I run fasterq-dump. here is one example for one file.

cd sra/ #((MUST BE SITTING INSIDE THE sra FILES DIRECTORY))
fasterq-dump --ngc ../prj_00000.ngc --split-3 --skip-technical SRR1.sra

# I ran each of these in a screen via parallel.

#R:
#for (i in to_dl){
#  cat(paste0('screen -dm bash -c \'prefetch -c ', i, ' ; fastq-dump -I --split-files --skip-technical sra/',i,'.sra\' ', '\n'))
#}

## each one looks like:
screen -dm bash -c 'fasterq-dump --ngc ../prj_000000.ngc --split-3 --skip-technical SRR1.sra'

# The ngc file must be referenced relatively and I have to still sit in the sra folder. Each one takes at least 40 minutes to download.
yatongli commented 4 years ago

Hello @cemalley, thanks a lot for your message!

I tried to use "prefetch --perm cart.jwt SRR1" with version 2.10.5, but got the error message "Currently, --perm can only be used from inside a cloud computing environment. Please run inside of a supported cloud computing environment, or get an ngc file from dbGaP and reissue the command with --ngc instead of --perm ."

So I switched to using --ngc, which seemed to have worked, and I have the SRR1.sra file stored. Then I cd'ed into the directory containing the .sra files and used "fasterq-dump --ngc xxx.ngc SRR1.sra", but again got the error message "invalid accession".

Regardless, thanks a lot for your time and help!

kwrodarmer commented 4 years ago

When you are outside of AWS us-east-1 or GCP us, you must access dbGaP using the --ngc option, sending the NGC file you obtained from dbGaP.

ONLY IF you are within AWS us-east-1 or GCP us should you use the --perm option with a JWT passport obtained from dbGaP or a JWT cart obtained from the SRA run selector.

yatongli commented 4 years ago

Thank you @kwrodarmer! Yes, that makes sense. However, when I used faster-dump --ngc xxx.ngc SRR(or SRR.sra), I still receive the error message "invalid accession". What could be the reason for this? Thank you for your time and help!

kwrodarmer commented 4 years ago

I invite you to send a more detailed message to sra-tools@ncbi.nlm.nih.gov for us to examine the exact usage and help you out. We recommend this in general with dbGaP usage to avoid leaking any privileged information.

yatongli commented 4 years ago

Thank you @kwrodarmer! I sent an email with detailed information ("Issues with sratools v2.10.5 fasterq-dump") about 12 hours ago to sra-tools@ncbi.nlm.nih.gov. I look forward to the response. Thank you for your time and help!

kwrodarmer commented 4 years ago

Thank you - I'll ask for it to be forwarded.

bounlu commented 4 years ago

I have the same issue as @yatongli experienced. I was downloading files just fine till few days ago:

tail -n+2 SRR_Acc_List.txt | parallel -j 20 "if [ ! -f {}_1.fastq.gz ]; then fasterq-dump {} --ngc prj_xxx.ngc && gzip {}*fastq && echo {} >> done; fi"

Now I get the below errors for the same dataset:

2020-05-13T03:58:00 fasterq-dump.2.10.5 err: empty while validating file within network system module - error with https open 'https://gap-download.be-md.ncbi.nlm.nih.gov/sragap/A80EC63F-3F2B-41AF-B4EC-F6235EE96F27/SRR1796877'
2020-05-13T03:58:00 fasterq-dump.2.10.5 err: invalid accession 'SRR1796877'
gzip: SRR1796877*fastq: No such file or directory

This happens for all the remaining files to be downloaded . It seems some technical issue exists in there: either the files are moved, or server has changed, or some bug is introduced.

Could you please check and let me know how to fix this?

kwrodarmer commented 4 years ago

Whenever there are network errors: blockages, timeouts, etc., you will be happiest to first prefetch your data and then run fasterq-dump.

We have reported the network issues to our networking team.