ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

Incosistencies with retrieving SRX data from different sources #934

Closed nasjr08 closed 1 month ago

nasjr08 commented 1 month ago

I am aware of different ways of retrieving .sra files from ncbi sources. The first is from the AWS bucket:

aws s3 cp s3://sra-pub-run-odp/sra/SRR1119486/SRR1119486 . fastq-dump --gzip --split-files SRR1119486

The second is via an ftp link using wget: ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR111/SRR1119486/SRR1119486.sra

It appears that the two downloaded links are of different size; the ftp links appear to be missing quality control metrics and this is reflected in the corresponding fastqc.html reports (happy to provide if it helps). Is this intentional or a mistake? Which of the two am I better of using?

durbrow commented 1 month ago

The run data stored on-prem by NCBI (e.g. NCBI's FTP site) do not have base quality scores. The run data stored by Amazon ODP do have base quality scores. This is intentional.


Which of the two am I better of using?

Only you can answer that. If you don't need the base quality scores, then the files without them are probably better for you, in as much as they are smaller, and will consume less bandwidth and/or storage space. If you don't know if you need the base quality scores, then you probably do not need them.

nasjr08 commented 1 month ago

Thank you!

stineaj commented 1 month ago

You can see a bit more info about the differences between those two here: https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/19/sra-lite/ We don't publish our data storage model to the public as it changes with some regularity based on usage and cost as well there are transitional states. Our current model has Normalized in AWS with Lite at NCBI and GCP. That is always subject to change and that is why we use the SDL (SRA Data Locator) and SRA Toolkit to handle storage model changes as seamlessly as possible for users.