ncbi / sra-tools

SRA Tools
Other
1.11k stars 244 forks source link

SRR vs SRX files #213

Closed twohrdrive closed 5 years ago

twohrdrive commented 5 years ago

Hello,

In a given biosample on NCBI, there appear to be many runs and experiments associated with any SRA.

I am interested in one organism within a larger Biosample which had it's transcriptome sequenced using paired-end Illumina technology.

If you navigate to the page for my organism, there are multiple links, one starts with "SRR" and the other with "SRX"

prefetch and fastq-dump works for either of these accessions. When I use fastq-dump and invoke --split-3, I get two fastq files for the two pairs, with no file for singletons for both the SRR and SRX archives. However, when I use grep -c "@" SRR#####, I get a larger number of counts than if I grep the SRX##### database that I downloaded. What is the difference between an SRR and SRX database, and which one should I use for denovo transcriptome assembly from a single species within a larger Biosample?

Best,

-A

twohrdrive commented 5 years ago

UPDATE: if I invoke grep -c "length" on the fastq files, as "length" is a part of the sequence description in fastq, I get 16348924 counts on BOTH the _1.fastq and _2.fastq files from the SRX database.

If I use the same grep command on the SRR database, I get exactly twice that amount for each file (32697848 counts). What does this mean?

kwrodarmer commented 5 years ago

Hi @twohrdrive - it would help us to give you a more specific answer if we knew the SRX in question. If this is confidential, you can email us at sra-tools@ncbi.nlm.nih.gov, although technically this is not a tool issue.

In the SRA object model, an SRX is a container of one or more SRRs. The only objects having data are SRRs. Version 2.9.6 of the tools are able to resolve simple SRX's that have only one SRR (expected to be the most common case in modern submissions). It would help me to reproduce your counting experiment by knowing the specific SRX.

twohrdrive commented 5 years ago

Hi, Thanks for the reply.

No, it is not confidential. The SRX number is: SRX2691251 and the page can be viewed here:

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5396442

kwrodarmer commented 5 years ago

In your initial report you stated that you had different results between the SRX and the SRR. I was unable to duplicate your results in that regard, and in fact there is no rational explanation for how it could be. SRX2691251 is a container for SRR5396442, and when downloaded they produce exactly the same object. This is because the only real object is the SRR.

As far as the different counts of '@' between paired ends, keep in mind that the quality scores are ASCII-encoded. A phred score of '@' in decimal is 31.

So to explain what I saw:

$ grep -c '@' SR*.fastq 
SRR5396442_1.fastq:13684193 
SRR5396442_2.fastq:15293234 
SRX2691251_1.fastq:13684193 
SRX2691251_2.fastq:15293234 

Clearly, the SRX and SRR are identical (and they must be). But it is curious that the paired ends appear to give different counts.

If we remember that the '@' has to be at the beginning of a line, we can try:

$ grep -c '^@' SR*.fastq 
SRR5396442_1.fastq:8264729 
SRR5396442_2.fastq:8353503 
SRX2691251_1.fastq:8264729 
SRX2691251_2.fastq:8353503 

and this gives closer numbers, but still not identical. Keeping with the grep theme, the following command counts the '@' lines properly:

$ for f in SR*.fastq; do echo -n "$f:"; grep '^@' $f | grep -c length; done 
SRR5396442_1.fastq:8174462 
SRR5396442_2.fastq:8174462 
SRX2691251_1.fastq:8174462 
SRX2691251_2.fastq:8174462 

And so you see, your tests were fooled by ASCII-encoded phred.