Closed twohrdrive closed 5 years ago
UPDATE: if I invoke grep -c "length"
on the fastq files, as "length" is a part of the sequence description in fastq, I get 16348924 counts on BOTH the _1.fastq and _2.fastq files from the SRX database.
If I use the same grep
command on the SRR database, I get exactly twice that amount for each file (32697848 counts). What does this mean?
Hi @twohrdrive - it would help us to give you a more specific answer if we knew the SRX in question. If this is confidential, you can email us at sra-tools@ncbi.nlm.nih.gov
, although technically this is not a tool issue.
In the SRA object model, an SRX is a container of one or more SRRs. The only objects having data are SRRs. Version 2.9.6
of the tools are able to resolve simple SRX's that have only one SRR (expected to be the most common case in modern submissions). It would help me to reproduce your counting experiment by knowing the specific SRX.
Hi, Thanks for the reply.
No, it is not confidential. The SRX number is: SRX2691251 and the page can be viewed here:
In your initial report you stated that you had different results between the SRX and the SRR. I was unable to duplicate your results in that regard, and in fact there is no rational explanation for how it could be. SRX2691251
is a container for SRR5396442
, and when downloaded they produce exactly the same object. This is because the only real object is the SRR.
As far as the different counts of '@' between paired ends, keep in mind that the quality scores are ASCII-encoded. A phred score of '@' in decimal is 31.
So to explain what I saw:
$ grep -c '@' SR*.fastq
SRR5396442_1.fastq:13684193
SRR5396442_2.fastq:15293234
SRX2691251_1.fastq:13684193
SRX2691251_2.fastq:15293234
Clearly, the SRX and SRR are identical (and they must be). But it is curious that the paired ends appear to give different counts.
If we remember that the '@' has to be at the beginning of a line, we can try:
$ grep -c '^@' SR*.fastq
SRR5396442_1.fastq:8264729
SRR5396442_2.fastq:8353503
SRX2691251_1.fastq:8264729
SRX2691251_2.fastq:8353503
and this gives closer numbers, but still not identical. Keeping with the grep
theme, the following command counts the '@' lines properly:
$ for f in SR*.fastq; do echo -n "$f:"; grep '^@' $f | grep -c length; done
SRR5396442_1.fastq:8174462
SRR5396442_2.fastq:8174462
SRX2691251_1.fastq:8174462
SRX2691251_2.fastq:8174462
And so you see, your tests were fooled by ASCII-encoded phred.
Hello,
In a given biosample on NCBI, there appear to be many runs and experiments associated with any SRA.
I am interested in one organism within a larger Biosample which had it's transcriptome sequenced using paired-end Illumina technology.
If you navigate to the page for my organism, there are multiple links, one starts with "SRR" and the other with "SRX"
prefetch
andfastq-dump
works for either of these accessions. When I use fastq-dump and invoke --split-3, I get two fastq files for the two pairs, with no file for singletons for both the SRR and SRX archives. However, when I usegrep -c "@" SRR#####
, I get a larger number of counts than if I grep the SRX##### database that I downloaded. What is the difference between an SRR and SRX database, and which one should I use for denovo transcriptome assembly from a single species within a larger Biosample?Best,
-A