ncbi / sra-tools

SRA Tools
Other
1.12k stars 246 forks source link

ERROR: Current preference is set to retrieve SRA Normalized Format files with full base quality scores. #786

Closed zyh2016 closed 1 year ago

zyh2016 commented 1 year ago

I have a question when using prefetch 3.0.2 to download on a linux server,there is a trouble: Current preference is set to retrieve SRA Normalized Format files with full base quality scores. How can I fix it? Should I update my sra-tool?

cmatKhan commented 1 year ago

I am intermittently having the same problem. More often than not, the prefetch command outputs the following without actually downloading anything:

prefetch-orig.3.0.2 --max-size 10000000000 --ngc ../my_cred.ngc mykart.krt 
Downloading kart file 'mykart.krt'
Checking sizes of kart files...

2023-03-04T22:41:10 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-03-04T22:41:11 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-03-04T22:41:13 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-03-04T22:41:14 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
...

I see this posted in currently open #768 with the comment 'use the current toolkit', which I believe is what I have, and what the OP is using. This is also mentioned in the closed issues #744, #765, #693.

stineaj commented 1 year ago

Does the command ever complete or exit? Are you ending the process manually? You can add -v up to three times to get a higher verbosity level to monitor if the process is active or stalled.

cmatKhan commented 1 year ago

The command does complete after ~3000 lines, which is about the number of files I am expecting to pull using this particular cart file.

When it completes, there is no data in the directory -- nothing has been downloaded.

When I add -vvv, I get this additional information:

2023-03-07T16:53:53 prefetch-orig.3.0.2: 'tools/ascp/disabled': not found in configuration
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking 'ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: 'ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking 'ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: 'ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/usr/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/usr/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/usr/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/usr/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/opt/aspera/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/opt/aspera/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/opt/aspera/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/opt/aspera/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/home/chasem/.aspera/connect/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/home/chasem/.aspera/connect/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: Checking '/home/chasem/.aspera/connect/bin/ascp'
2023-03-07T16:53:53 prefetch-orig.3.0.2: '/home/chasem/.aspera/connect/bin/ascp': not found
2023-03-07T16:53:53 prefetch-orig.3.0.2: KClientHttpOpen - connected from '65.254.100.81' to locate.ncbi.nlm.nih.gov (165.112.7.16) 
2023-03-07T16:53:53 prefetch-orig.3.0.2: Setting up SSL/TLS structure
2023-03-07T16:53:53 prefetch-orig.3.0.2: Performing SSL/TLS handshake... 
2023-03-07T16:53:53 prefetch-orig.3.0.2: KClientHttpOpen - verifying CA cert 
2023-03-07T16:53:53 prefetch-orig.3.0.2: Verifying peer X.509 certificate...
2023-03-07T16:53:53 prefetch-orig.3.0.2: Reading from server...
2023-03-07T16:53:53 prefetch-orig.3.0.2: Reading from server...

...

2023-03-07T16:53:59 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-03-07T16:53:59 prefetch-orig.3.0.2: KClientHttpOpen - connected from '65.254.100.81' to locate.ncbi.nlm.nih.gov (165.112.7.16) 
2023-03-07T16:53:59 prefetch-orig.3.0.2: Setting up SSL/TLS structure
2023-03-07T16:53:59 prefetch-orig.3.0.2: Performing SSL/TLS handshake... 
2023-03-07T16:53:59 prefetch-orig.3.0.2: KClientHttpOpen - verifying CA cert 
2023-03-07T16:53:59 prefetch-orig.3.0.2: Verifying peer X.509 certificate...
2023-03-07T16:53:59 prefetch-orig.3.0.2: Reading from server...

...

I am following the instructions that I can find for downloading this software, and then running vdb-config -i. The instructions say that there are no required settings necessary in the config interface. If there are instructions on what ascp and aspera are, and if that is the problem, I'd appreciate being pointed in the correct direction. As far as I know, form the last time I used this command (about 2 or 3 weeks ago) when it worked, nothing has changed on my system.

edit: in looking for the ascp line, I see it mentioned in this issue: #255

stineaj commented 1 year ago

ascp is the copy program from the Aspera Connect software. It is still used extensively for uploads to dbGaP and SRA but recent changes to our distribution hardware make Aspera not much of a benefit for users downloading data. These messages come up first just because the logic of the prefetch program is to try ascp first and if it is not available to use HTTPS instead.

I assume you are getting a long list of "Reading from server..." messages? If so that should indicate the files are transferring.

Where are you expecting the downloaded files to be stored? Did you specify a user-repository in the cache menu of vdb-config?

cmatKhan commented 1 year ago

image

When I used this previously, it has meant that the files are downloaded into my $PWD, which is /scratch/mblab/chasem/dbgap.

I am running the command right now, and yes there are a long list of "Reading from server..." messages.

However, nothing is downloading into my $PWD, or into the ncbi-cache directory (which is in my $PWD):

[chasem@login dbgap]$ tree -L 1 .
.
├── all_cart_prj33268_202303061451.krt
├── ncbi_cache
├── phs000007.v33.pht000905.v5.p14.ctpericard1_2005s.var_report.xml
├── prj_33268.ngc
├── sra_prefetch.sh
├── sratoolkit.3.0.2-ubuntu64
└── test_cart_prj33268_202303030808.krt

and the ncbi_cache:

[chasem@login dbgap]$ tree ncbi_cache/
ncbi_cache/
├── files
├── nannot
├── refseq
├── sra
└── wgs

5 directories, 0 files
klymenko commented 1 year ago

Running prefetch with -vv will print you the destination path where each file is being downloaded.

cmatKhan commented 1 year ago

I attached above the result of running prefetch with -vvv. Presumably -vv is a subset of that.

However, after a less than enlightening back and forth with the help-desk, this is what seems to be occurring:

Since I can successfully download subsets of the cart file in question, the following seems to be true:

  1. There is a max limit on the size of the cart itself -- not just the individual file size.
    1. This max limit is not documented, as far as I can tell
  2. When the cart file exceeds this limit, prefetch fails silently

The recommendation from the help desk amounts to:

That 1TB is considered large is good to know -- that was not my assumption, given the type of data these servers handle. How many parts that 1-3TB cart needs to be broken into is apparently subject to guess and check.

If it is the case that there is a max limit on the cart size (not individual files, which I am aware is adjustable with --max-size), then here are some feature requests:

  1. The website interface should not allow a user to create a cart file that exceeds the maximum cart size limit

    1. That limit should be written at the top of the file selector site, something like, "cart files may not exceed <some limit>".
    2. If a user selects so many files that the maximum size is exceeded, then when the user clicks "create cart", pop up a modal that says, "The total size of the files selected exceeds the maximum allowed cart size of <some limit>. Some files must be removed before a cart file may be created."
  2. As a backup to that, the prefetch command should not fail silently on this -- it should check the cart size (possibly through iteration, though it seems like this is information that could just be included as metadata in the cart file format, whatever that is) and then fail with an error message that says, "The cart file describes a set of files which together exceed the maximum single submission allowed download size of <some limit>. To address this, create multiple cart files, none of which describe a set of files that exceeds <some limit>"

klymenko commented 1 year ago

prefetch does not have limit on the size of the cart.

stineaj commented 1 year ago

To expand on the last comment, there is no limit for how much prefetch could read from the cart file. But user experience has shown there might be a functional limit to prevent timeouts. Unfortunately that behavior is variable from user to user and between datasets. In the past we have focused primarily on the efficiency of the client-server interactions as the solution but @cmatKhan raises good points on user interface changes that could improve the download process.

cmatKhan commented 1 year ago

@stineaj, I appreciate that database management, particularly on a scale like this and with truly sensitive data, is hard. I am not trying to nitpick. But I am trying to get data, and this has so far been excruciatingly difficult.

Presumably a timeout implies that at least 1 download at least starts.

That doesn't describe what is actually happening.

What is occurring is that on certain cart files, for example 1 that we created which describes files no greater than 1GB, prefetch works. But another cart file, for instance one that describes a great number of files which in total will be somewhere between 1 and 2 TB, with the same settings (including max-limit large enough to turn that limit off), produces effusive logs with -vvv, none of which indicate ERROR or WARNING, and then seemingly exits successfully without error. However, nothing is downloaded.

I want to be clear: I have set --max-limit to anywhere between 100 and 10000 GB just to eliminate that as a possible cause of failure. In one of the code chunks I sent back to the help-desk, I accidentally omitted that line, and was shamed for it. But, I have not actually at any point omitted that setting in trying to grab this data, and unless I am greatly mistaken, any number that is greater than the max file size, regardless of how "unreasonably large" that number is (eg 10000GB), does not matter. I assume this is a if-condition, and I am setting a threshold, that is all.

If there are "unreasonably large" individual files in your database that I shouldn't grab, that is certainly not something that I bear a responsibility of policing.

I am now having seemingly a different issue.

I am running a cart with a file that describes genotype data with files in the rante of '100GB' to 150GB (part of the guess and check strategy to figure out what "reasonable sized chunks" means).

It has been running for close to 3 hours, with this printing to stdout:

image

Edit: I should say, I am running this on an HPC which runs Ubuntu 20.04. The connection speeds are both stable and fast (154 megabyte/s) on our end.

At the very least here, what would be helpful is just a bit better feedback from prefetch in terms of what is happening so that I can start to diagnose it. If the cart is beyond the recommended limit, maybe a WARNING from the sounds of it. If the server is timed out, somehow prefetch should fail (i realize that might take some effort to program).

As it stands, it is extremely hard to discern from the logging what is happening in prefetch.

Edit 2: The cart describing files between 100GB and 150GB did actually download something -- 2 files out of I'm not sure how many. Possibly 2. It took 4 hours where the lines shared in the image above were continually printed to stdout. Actual download time was negligible.

cmatKhan commented 1 year ago

and now I am once again trying a different size cart -- this time, one with files between 1GB and 10GB with no binary files -- and I am getting the following once again:

2023-03-07T16:53:59 prefetch-orig.3.0.2: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-03-07T16:53:59 prefetch-orig.3.0.2: KClientHttpOpen - connected from '65.254.100.81' to locate.ncbi.nlm.nih.gov (165.112.7.16) 
2023-03-07T16:53:59 prefetch-orig.3.0.2: Setting up SSL/TLS structure
2023-03-07T16:53:59 prefetch-orig.3.0.2: Performing SSL/TLS handshake... 
2023-03-07T16:53:59 prefetch-orig.3.0.2: KClientHttpOpen - verifying CA cert 
2023-03-07T16:53:59 prefetch-orig.3.0.2: Verifying peer X.509 certificate...
2023-03-07T16:53:59 prefetch-orig.3.0.2: Reading from server...

That is printed to stdout repeatedly, and then prefetch exits successfully, without having downloaded anything. This is the same issue that the OP originally reported, and the same issue that I have now had repeatedly with carts of many different size and compositions.

Just to reiterate -- this same command just worked (after 4 hours of "Reading from server..." printing to stdout) on a cart describing files between 100GB and 150GB.

cmd

$ ./sratoolkit.3.0.2-ubuntu64/bin/prefetch-orig.3.0.2 -vvv -X 5000GB --ngc prj_33268.ngc genotypes_1GB_to_10GB_no_binary/genotypes_1GB_to_10GB_no_binary_cart_prj33268_202303081712.krt

Edit: After submitting that 3 times in a row with the same result -- nothing downloaded, no errors -- i submitted the exact same command, same cart, a 4th time and hey! it worked.

There is something wrong. At the very least, the thing that is wrong is the logging.

stineaj commented 1 year ago

First, there should be no shame from anyone for omitting a needed switch in a command or in the description of unexpected behavior. I offer my apologies for your software experience and your service experience. And very importantly I am glad to hear your download is working.

The max file size is there mostly to ensure the users understand how much they will be downloading and that the destination has sufficient space. Before adding that switch one of the primary bug reports received was due to the user storage not being able to accommodate the size of the downloads.

In this case the timeout I am referring to is not necessarily from the download itself but instead from the webservice that tells the toolkit the file locations for download requests. The storage location for a file could be on commercial cloud providers or NCBI servers for either public or protected access submissions. In certain situations large cart files can lead to a timeout from that location service.

We will look into the information you provided through the trouble ticket and use that to improve the service. Thank you.

cmatKhan commented 1 year ago

I'm going to push back on characterizing this as "working", though occasionally one of the cmds does go through and db-gap deigns to serve me some data.

For the vast majority of attempts, prefetch fails to "fetch" anything at all, even though it exits successfully and there are no WARNING or ERROR codes reported in the highest verbosity log setting.

It is also totally unclear when waiting for 4 hours whether or not a download will ever actually begin, or if it is just stuck.

I think calling this "working" is a stretch.

klymenko commented 1 year ago

Add to your prefetch command --order kart . It will reduce the wait time before start of the first actual download.

taylordm commented 1 year ago

I'm trying to d/l phenotype files -- only 2MB -- and getting the same problems as the people above. Is there any obvious solution? I've installed the latest sra toolkit just today, and know how to d/l the files from sra (done it before) but nothing is coming back from the prefetch command using ngc and krt files.

Edit: Ah, I think I can see a problem. I am not using ascp, and it's trying to use the HTTP protocol to d/l the files. However, I am behind an institutional firewall, which may be preventing the d/l through KClientHttpOpen. I'll install ascp and report back in this comment.

Edit^2: Nope, it's not the ascp installation. Still getting no d/l, and no report on where the d/l is being put. It seems to recognize that I have ascp.

I'll try this from my laptop instead of the server.

cmatKhan commented 1 year ago

@taylordm None of the solutions offered by the NCBI help desk, or here, consistently addressed this issue.

However, I can confirm that prefetch sometimes works. So far, I have been able to get some of the data I have been wanting by manually resubmitting the same command until it gives up the data.

I've been trying cart files of various composition to try to figure out if that affects anything. As far as I can tell, this has little effect, with the caveat that the cart file which I originally tried that describes somewhere between 1 and 2 TB has never pulled data.

re: ascp -- I can't make sense out of what the docs say here, but maybe this will make sense to you:

sratools docs: Avoid using ascp directly for downloads

I believe the log messages re: ascp occur no matter what. What I can say for sure is that whenever my downloads have been successful, they have occurred over HTTPS.

I would encourage you to reach out to the NCBI help desk. Don't let them gaslight you and say that pretetch "works". That it mostly doesn't work, at least for certain data (maybe its the datatype? Are you accessing dbgap data?), is a problem.

What I have been harping on is that at the very least, more informative logs would be helpful. I don't know what the issue is, of course, and fixing it may be intensive. But a message in the log that gives some hint that prefetch at least recognizes that failing to fetch anything is an error would be nice.

klymenko commented 1 year ago

Aspera is being disabled. prefetch tries its best but it downloads from the servers. If the prefertch download fails - rerun the same command again and prefetch will try to continue or retry the failed downloads.

klymenko commented 1 year ago

Do you need more help?

Han-Cao commented 1 year ago

Add to your prefetch command --order kart . It will reduce the wait time before start of the first actual download.

I have the same issue when download more than 1000 files from cart, and --order kart can fix it.

danafton commented 1 year ago

Experiencing similar issues as others here. What a terribly cumbersome and buggy platform dbgap is.

In any case, the workaround I have found for trying to download a large number of files which seems to help is to add "--rows 1-100" at the end of my prefetch command and then cycle through higher row numbers. i.e. I am manually batching things out because in the year 2023 we can't have a computer do this for us.

cmatKhan commented 1 year ago

@danafton I struggled and struggled with this. I was given some bad advice from NCBI (for instance, that my upper limit on max file size was "unreasonable". This is of course nonsense. The only way to disable to the max file limit is to set a large number, the size of which doesn't matter if the goal is inf. My network can handle it -- the problem wasn't on my side).

What I found is that despite repeatedly not working, if you are persistent an keep resubmitting, eventually it does.

I did end up breaking up my karts into smaller parts. Keeping the plain text tsv of the kart, along with the kart file, is a good idea, of course. Makes auditing what you have retrieved in a bunch of small submissions easier.

amstilp commented 1 year ago

@cmatKhan I am having similar issues downloading phenotype data from dbGaP - not even large genomic files - with a max cart size of ~65 Mb. prefetch will often fail to download a single file; or will sometimes report that it has downloaded a file but doesn't actually download it. In these cases, it never exits with an error code (as it should, in my opinion).

Since I need to script this, I will likely have run prefetch in a loop until all files are actually downloaded. Were you able to get a tsv of files in the kart file from the kart file itself, and if so, how?

cmatKhan commented 1 year ago

Edit: in re-reading,, I realized I did misunderstand the question:

no, I wasn't able to create a tsv from the kart. I don't know what the kart format actually is (not a bad question for NCBI). But, the dbGap site gives you the option of creating either a kart file or a tsv. It is a very good idea to create both, in my opinion, and then use the tsv to audit.

amstilp commented 1 year ago

Thanks @cmatKhan. I saw that I could export a tsv from the "Files table" button above, but I was hoping you'd found a clever way to extract it from the kart file. Alas.

OlivierCoen commented 11 months ago

Hi @cmatKhan, I had exactly the same issue as yours. Prefetch was working inconsistently on SRR ids corresponding to small datasets. My assumption is that it must have been a configuration issue but I could not figure out what the problem was. I managed to have prefetch work every time by using the Docker image provided by NCBI: https://hub.docker.com/r/ncbi/sra-tools and following the setup guidelines here: https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker Running prefetch inside the container works like a charm every time I use it. It's more complex than using prefetch directly, but at least I could download the files I needed

cmatKhan commented 11 months ago

Interesting -- thank you!

I would not be surprised if you see inconsistent performance in the container, too -- my solution ended up being (per NCBI) 'just keep re-submitting. maybe someday it will work'. Lo and behold, it did (still inconsistently). It seems like this problem has to do with the server rather than user config.

But, whatever witchcraft makes it work, fantastic.

OlivierCoen commented 11 months ago

Actually I just realized that the SRA toolkit version used in the docker image in 3.0.1, while the version I was using (outside the container) was 3.0.7 (the latest currently). I downloaded and installed SRA toolkit version 3.0.1 with the provided script (https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/setup-apt.sh), after changing version from 3.0.7 to 3.0.1 in the script file. I did not configure SRA toolkit any further. Now it's working well, like in the container :)

Edit: Actually version 3.0.1 eventually worked inconsistently too ;) but I feel like it works more often than version 3.0.7 (i've not made stats about it though) It's definitely an issue with NCBI's servers

MagpiePKU commented 10 months ago

Hi,

I would like to second that I have encountered the same problem with 3.0.7 binary.

We tested like:

/path_to_sratoolkit.3.0.7-centos_linux64/bin/prefetch -X 5000GB -vvv -t http -f yes SRR10156643

Then it failed without saving anything.

The output was:

2023-12-10T17:58:29 prefetch.3.0.7: heartbeat = 60000 Milliseconds
2023-12-10T17:58:29 prefetch.3.0.7: Seeding the random number generator
2023-12-10T17:58:29 prefetch.3.0.7: Loading CA root certificates
2023-12-10T17:58:29 prefetch.3.0.7: Configuring SSl defaults

2023-12-10T17:58:40 prefetch.3.0.7: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.