Original FastQ headers - Githubissues

ncbi / sra-tools

SRA Tools

Other

1.13k stars 247 forks source link

Original FastQ headers #130

Closed hp2048 closed 6 years ago

hp2048 commented 6 years ago

Hi We have downloaded SRA file from https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR1019/ERR1019069.

We are using prebuilt fastq-dump (v2.9.0) sratoolkit.2.9.0-centos_linux64/bin/fastq-dump.2.9.0.

The command to get FastQ files from the SRA file was as follows:

fastq-dump -v 1 --split-3 -F -X 3 -Z ERR1019069.sra

RESULT:

@1
GGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTT
+1
@@BDFFFBFFHFDEHHIGIFHFGIJCDCGHGBHHIHI?DFFHIFGFCHH=CHIIG.=AHHH;?D?C>6;;?CBBBB(5?#####################
@1
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
+1
CCCFFFFFHHHHHJJIIJIJJJGGGIHIJJJIGGIIJIFJJJJIJJJIFHGCGHJIEGGIHHGC@@BEDEEECDBD(=BD<<CB?A?<38A+293(88<<
@2
CCTAACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCCTACCCCCTACCTCTAACCATAACCCTAACCCTAAC
+2
CCCFFFFFHHHHGJIGEGIGHH=H;EFDH7?BBFFC;3BDGHE3BGCCHHGII2@CHEH#########################################
@2
TAGGGTTGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGGTAGGGTTGGGTTAGGGGTAGGGTTAGGGTTAGGGTTGGGGGTAGGGTT
+2
@@@FFDDFHHHHHGEIEECHHH9EBFHI*8@FI6B=FCC=;CEGC)8(6@C#################################################
@3
GTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGTT
+3
@@@DFFFDFFDHHEGGHJJ@FHHI?F=FIJ@FHIE;;?BDGBC@GFH.=@@EEAHFBFF@CBACD;;?ACB(9<<<A>ABA?89???C?+8<A?A#####
@3
CTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCAAACCCCTAACCCCAAC
+3
CCCFFFFFHHHHHJJJJJJJIIIJIIJJJJICHIJJECHHIJ<FFGGH;CFHEG9FGGI9(=7AE?67)>@2655?########################

I have also used --defline-seq '$ac $si $sn $sg $ri $rn' with no avail in retrieving standard Illumina headers.

Could you please let us know how to get original headers like the following?

@HWUSI-EAS100R:6:73:941:1973#0/1

kwrodarmer commented 6 years ago

These names are not stored in the SRA, so there is no special command that can be used to produce them.

hp2048 commented 6 years ago

Hi,
Thank you for your quick response. I tried the following: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR101/004/ERR1019034/ERR1019034_1.fastq.gz The headers in the file is as follows for the first sequence: @ERR1019034.1 HS2000-715_428:7:1312:17045:49660/1 CTAACCCTAAC.....................CTAACCCTAACCCTAAC And for the same file from NCBI SRA using the following command: ../sratoolkit.2.9.0-centos_linux64/bin/fastq-dump -F -X 2 -Z ERR1019034 I get the following header line: @1 CTAACCCTAAC......................GGGTTAGGGTTAGGGT

In this case it does look like SRA doesn't have the information but ENA does. Is it really the case or am I missing something here?

Kind regards Hardip

kwrodarmer commented 6 years ago

The file was originally submitted to the ENA, who recorded it as fastq with original read names. When mirrored to NCBI, the read names were stripped.

It may be worth explaining a bit of the rationale behind this. Those that try to interpret the read names in some way are assigning special meaning to a string that is entirely 100% free form, and whose only clear interpretation is as an opaque identifier; containing exactly as much information as the serial number used within the SRA. Of course, people are trying to store little bits of significant information in the string that they can later retrieve. The problem is that every sequencing center does this a little differently, and the names are only guaranteed to be significant to the original submitter.

The SRA stores NGS information, not format. The read names can only be taken as opaque identifiers, since we have no reliable way of determining what information they could carry so as to extract it and store it in our database. This is a case of using what amounts to a comment field in an extremely ambiguous format and trying to assign special significance to it. We spent the first several years of the SRA's development in vain trying to process read names in order to retain any information that could be perceived. Had people left the names as assigned by their sequencing system, then we would have continued to process and store the names; made possible by being able to know how names were generated and formatted. But a very high percentage of submitters were altering the names, perhaps using a perl script or the like, in order to assign their own magical significance to the reads, and as a result would break our loading process. For the first years of the SRA, the number 1 cause of rejecting submissions was based on badly formatted read names that we couldn't pick apart.

The alternative to trying to pick apart the names was to store them as opaque tags. This is what most people assume is done. However, to do so would be a huge overhead in storage. Not many places around the globe store as much raw data as NCBI, or have to pay for it in disk purchases and maintenance, electrical, cooling, and 24x7 personnel. To spend so much to store a data series that by file format definition is no more informative than a serial number could not justify the millions of additional dollars.

So please be aware that while people may find read names a convenient place to store their most interesting tidbits of data, an automated loading pipeline has no way of knowing whether the names are significant or contain your grandmother's secret cooking recipe. We discard them and replace them.

There is a legitimate use of these tags in terms of synchronizing processing between separate pipelines; i.e. by using the read names to identify original sequence. But for that purpose, the original read also serves as a wonderful sequence identifier.

So yes, it is the case that original files, or fastq generated directly from the original files, will preserve the read names. The SRA stopped preserving them years ago when we could no longer justify the expense of attempting to manage them.

hp2048 commented 6 years ago

Dear Thank you so much for the detailed response. I truly appreciate how NCBI has allowed all researchers to store their valuable data and made it available for wider use. Your efforts in maintaining the system that you maintain is truely heroic and significant.

Coming back to the point of FastQ headers, I just wanted to see if I am missing something in terms of improper use of the SRA toolkit and therefore headers are missing. Your answer is very clear in helping me understand that SRA has not stored header values in this case and it is a great help.

My reasons for the header information was to extract machine, flow cell and lane IDs from FastQ files. Illumina's bcl2fastq incorporates machine, flow cell and lane information in the headers. If by any luck, it would have been stored then I could use these informations for identifying lane/machine/flow cell level PCR and optical duplicates or any biases that may be present due to machine's components. All these information can be useful in refining variant call sets.

Once again, I understand that the information is not stored for very valid reasons as you mentioned. I appreciate your time in responding to my queries.

Hardip

hp2048 commented 6 years ago

I have obtained the required response and therefore I am closing this issue. Kind regards Hardip

bartgrantham commented 6 years ago

@kwrodarmer, is there somewhere on NCBI's site that discusses this design decision in more detail? I assume that this decision was not taken lightly, since it has huge implications for users comparing sets of reads when the unique identifier is lost.

kwrodarmer commented 6 years ago

I don't know that we've put it up on the NCBI site (although you should know that this GitHub node is maintained by NCBI), but I did start an explanation that may be expanded over time at

https://github.com/ncbi/sra-tools/wiki/Read-Names

You're right that the decision was not taken lightly. There have been implications for users who are making reasonable use of read names, but there were also implications for the SRA as a whole that we battled against over a period of years. It's unlikely that most users would have any awareness of this other side, unless they were also submitters and had experienced the difficulty of ETL.

The bottom line is that there are important bits of information being stored in a field that has several possible firm formats along with an infinite number of arbitrary but fully legal formats, which makes ETL impractical. It is difficult to understand why valuable information would be squirreled away in a name rather than given its own proper field in whatever format is being used.

Take a look and by all means make any additional comments here.

rwhetten commented 5 years ago

@kwrodarmer One case in which information in the read header is not equivalent to a random serial number is output from the 10x Genomics pipeline, where the header contains barcode information that establishes links among multiple reads in a read cloud. Is there any way to recover this barcode information from SRA files containing data from 10x Genomics experiments?

hanasusak commented 3 years ago

I am downloading 10x single cell fastq files, but they are useless without read header as @rwhetten already pointed out. I am missing the barcode info that is stored in the header, and therefore can not run cellranger software (as it quits with the message 'stage error: Invalid read Qname!')

ETaSky commented 2 years ago

I just came across this old post because I am scratching my head why SRA replaces the original header of submitted fastq file. I can understand the rationale of the design. However, one significant problem because of this is one cannot verify if a pair of reads from 2 fastq files (forward and reverse) is indeed a pair. Can we trust the order that SRA generates are the original order of the sequences?

To make it worse, on ENA website, they would merge forward and reverse together (although it is possible to download submitted files), the reads in those merged files seems not in particular order. So basically there is no way that the reads can be separated back into forward and reverse reads. Consequently, the reads cannot be properly used therefore wasting all those storage space.

The file was originally submitted to the ENA, who recorded it as fastq with original read names. When mirrored to NCBI, the read names were stripped.

It may be worth explaining a bit of the rationale behind this. Those that try to interpret the read names in some way are assigning special meaning to a string that is entirely 100% free form, and whose only clear interpretation is as an opaque identifier; containing exactly as much information as the serial number used within the SRA. Of course, people are trying to store little bits of significant information in the string that they can later retrieve. The problem is that every sequencing center does this a little differently, and the names are only guaranteed to be significant to the original submitter.

The SRA stores NGS information, not format. The read names can only be taken as opaque identifiers, since we have no reliable way of determining what information they could carry so as to extract it and store it in our database. This is a case of using what amounts to a comment field in an extremely ambiguous format and trying to assign special significance to it. We spent the first several years of the SRA's development in vain trying to process read names in order to retain any information that could be perceived. Had people left the names as assigned by their sequencing system, then we would have continued to process and store the names; made possible by being able to know how names were generated and formatted. But a very high percentage of submitters were altering the names, perhaps using a perl script or the like, in order to assign their own magical significance to the reads, and as a result would break our loading process. For the first years of the SRA, the number 1 cause of rejecting submissions was based on badly formatted read names that we couldn't pick apart.

The alternative to trying to pick apart the names was to store them as opaque tags. This is what most people assume is done. However, to do so would be a huge overhead in storage. Not many places around the globe store as much raw data as NCBI, or have to pay for it in disk purchases and maintenance, electrical, cooling, and 24x7 personnel. To spend so much to store a data series that by file format definition is no more informative than a serial number could not justify the millions of additional dollars.

So please be aware that while people may find read names a convenient place to store their most interesting tidbits of data, an automated loading pipeline has no way of knowing whether the names are significant or contain your grandmother's secret cooking recipe. We discard them and replace them.

There is a legitimate use of these tags in terms of synchronizing processing between separate pipelines; i.e. by using the read names to identify original sequence. But for that purpose, the original read also serves as a wonderful sequence identifier.

So yes, it is the case that original files, or fastq generated directly from the original files, will preserve the read names. The SRA stopped preserving them years ago when we could no longer justify the expense of attempting to manage them.

kwrodarmer commented 2 years ago

However, one significant problem because of this is one cannot verify if a pair of reads from 2 fastq files (forward and reverse) is indeed a pair. Can we trust the order that SRA generates are the original order of the sequences?

The reads are joined on read name. We detect and join pairs exactly as you would. We record read orientation.

ETaSky commented 2 years ago

However, one significant problem because of this is one cannot verify if a pair of reads from 2 fastq files (forward and reverse) is indeed a pair. Can we trust the order that SRA generates are the original order of the sequences?

The reads are joined on read name. We detect and join pairs exactly as you would. We record read orientation.

Thanks for your reply. I understand what you are talking about. But the issue I am discussing here is about validation. If you look like the sequencing run ERR1972104 on SRA. I have spent tons of time to figure out what is wrong with this file.

You will find a lot of sequences don't have a pair. But if you check the original submitted files using fastp, the read pairs checked out.

The problem is probably that the original submitter messed up with their own sequence header. When you use those names to join reads, you will end up with a lot of unpaired reads. But within the original header, you can use the original illumina sequencing header (which is deleted in SRA generated fastq) to match the pair, which is not possible if using SRA generated fastq file.

durbrow commented 2 years ago

A great deal of (programmer and CPU) effort goes into correctly identifying and matching up mates, but no amount of automated "intellegence" can properly deal with all the ways that humans can invent to mess it up (or reliably detect that it isn't working).

As for storing the original names in the online archive, this is unlikely to ever happen. The storage is not free. It's just too many bits to allocate to so little information. Plus, the originals can still be requested.

hp2048 commented 2 years ago

I understand entirely the reasons for stripping read headers. SRA is an archive, and researchers worldwide use it as an archive. Thank you to NCBI, ENA and DDBJ (INSDC) for making this happen. My two cents on the subject:

Regarding the originals can still be requested.: unfortunately, this is not the case because researchers use SRA as the archive for their data and don't store them on-prem.

An alternate solution can be archiving files as-is. Then programmer and CPU hours can be saved to offset costs associated with increased storage. Also, this will make data submitters responsible for their actions, as they should be, and their actions will be visible to the community.

Perhaps a costly solution can store string before space in fastq headers in a separate file linked to the SRA converted read index. If end users desire, they can reconstruct headers using this file.

I am guilty as much as anyone in reformatting fastq headers when high-throughput sequencing became a thing. However, there was a general lack of standards, and we were all learning to identify the path forward. However, my understanding is that the sheer size of data, complexities of experiments, availability of "good" standards, trained and competent workforce, increased software compatibility and availability of good quality software, among others, has changed the way researchers manage their data. Data is valued and generally stored with good care. Perhaps not my place to say this, but SRA may revisit the decision to strip read names and establish a cost-benefit analysis with new data submissions.

I fully support NCBI and other international repositories alike for taking care of critical data for the benefit of humanity, and 100% agree that some tough decisions always have to be made.

ETaSky commented 2 years ago

I understand entirely the reasons for stripping read headers. SRA is an archive, and researchers worldwide use it as an archive. Thank you to NCBI, ENA and DDBJ (INSDC) for making this happen. My two cents on the subject:

Regarding the originals can still be requested.: unfortunately, this is not the case because researchers use SRA as the archive for their data and don't store them on-prem.

An alternate solution can be archiving files as-is. Then programmer and CPU hours can be saved to offset costs associated with increased storage. Also, this will make data submitters responsible for their actions, as they should be, and their actions will be visible to the community.

Perhaps a costly solution can store string before space in fastq headers in a separate file linked to the SRA converted read index. If end users desire, they can reconstruct headers using this file.

I am guilty as much as anyone in reformatting fastq headers when high-throughput sequencing became a thing. However, there was a general lack of standards, and we were all learning to identify the path forward. However, my understanding is that the sheer size of data, complexities of experiments, availability of "good" standards, trained and competent workforce, increased software compatibility and availability of good quality software, among others, has changed the way researchers manage their data. Data is valued and generally stored with good care. Perhaps not my place to say this, but SRA may revisit the decision to strip read names and establish a cost-benefit analysis with new data submissions.

I fully support NCBI and other international repositories alike for taking care of critical data for the benefit of humanity, and 100% agree that some tough decisions always have to be made.

I guess what @durbrow means by "Plus, the originals can still be requested." is that on NCBI SRA, you can use the Cloud Data Delivery Services to get the original submitted data (although there seems not to be an explicit notification on NCBI on this), and if the data were originally submitted to ENA, you can also downloaded the submitted files. I am not sure if I understand the situation correctly.

durbrow commented 2 years ago

Yes, as @ETaSky says, the original data submitted to NCBI is available by request. As for saving on CPU/programmer time and spending it on online storage, my thinking is that it is better (more efficient in total) for NCBI to do the work once than to shift the burden to the community. Moreover, if we who deal with this problem daily have trouble getting it right, imagine how much more difficulty our users have with it. It is quite likely that any savings on programmer time would be offset by support questions for the curators about how some submitter mangled their read names.

rwhetten commented 2 years ago

It would be helpful to the users of 10X Genomics data to make it clear that the original data can be requested from NCBI. Unlike most sequence data, 10X Genomics headers contain information that is essential to the value of the data, so the standard SRA practice of removing headers destroys much of the value of the data. Knowing that the original format of the data is available by request is important.

On Wed, May 18, 2022, 10:35 AM Kenneth Durbrow @.***> wrote:

Yes, as @ETaSky https://github.com/ETaSky says, the original data submitted to NCBI is available by request. As for saving on CPU/programmer time and spending it on online storage, my thinking is that it is better (more efficient in total) for NCBI to do the work once than to shift the burden to the community. Moreover, if we who deal with this problem daily have trouble getting it right, imagine how much more difficulty our users have with it. It is quite likely that any savings on programmer time would be offset by support questions for the curators about how some submitter mangled their read names.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/sra-tools/issues/130#issuecomment-1130098770, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATEEIPCC36G655NKYO56IDVKT53HANCNFSM4FAXVURQ . You are receiving this because you were mentioned.Message ID: @.***>

ETaSky commented 2 years ago

Upon further checking a few datasets on SRA that I am trying to download. Unfortunately, the format and option to obtain the original submitted files are not uniform across different SRA records. One study, the original submitted files are available only through FTP site; another one, the original submitted files appeared to be able to access through SRA Cloud Data Delivery. Also, I am not particular sure if we can select the format of data to deliver during the Cloud Data Delivery request page (since the study I tested only delivered the SRA modified ones). I really hope SRA could have a uniform mechanism for the access of original data.

durbrow commented 2 years ago

I am not particular sure if we can select the format of data to deliver during the Cloud Data Delivery

You will get it in the form the submitter sent it.

I really hope SRA could have a uniform mechanism for the access of original data.

This is likely impossible. By neccesity, the location of submitted data changes over time. For example, they are sent to NCBI and are on NCBI storage systems at least until processed (or later as submitters can request to hold their data until their studies are published). After being processed, they may sit on NCBI storage systems until they are copied to cloud cold storage locations and registered in CDDS. You should also consider that at times the volume of data submitted to SRA can be so large as to overwhelm our processing pipelines, backup systems, or our outbound pipes to cloud providers. Suffice it to say, it can take some (often unpredictable) amount of time for submissions to enter a state where their locations are unchanging.

bartgrantham commented 2 years ago

Moreover, if we who deal with this problem daily have trouble getting it right, imagine how much more difficulty our users have with it.

I've seen researchers lose months of precious time because they misunderstood the mutation SRA performs on datasets and how to use sra-tools to work with it, wasting bandwidth, storage, and compute in the process. It's heartbreaking to witness.

Why this is framed as an ETL problem (as described in https://github.com/ncbi/sra-tools/wiki/Read-Names), as opposed to simply archiving bits isn't clear. What processing is required (other than what's required for the SRA storage scheme)? That the data isn't served exactly as provided conflicts with the "principle of least surprise", and this has proved to be a much larger burden to the community than I think NCBI appreciates.

The SRA's position appears to be: because fastq is such a loose format with unpredictable formatting in the sequence id, for online access you choose to only store sequence and quality data and throw away everything else. If this is an accurate summation I think this should suffice as an explanation. Subtly implying that people are somehow doing it wrong by storing information in the sequence id doesn't contribute to people being able to use your service.

To state the obvious, the choice of SRA to alter data makes it impossible for researchers to exactly reproduce the output of an analysis, especially in the case where critical information such as barcodes are embedded in the sequence id (such as 10X data). Providing clear direction for how a user can request original files (from Cloud Data Delivery Services?) is important. I don't know a single researcher who would prefer immediate SRA data over waiting for original archived fastq's.

As an aside, I've been biting my tongue on this thread for almost 4 years. The discussion above around how people use the fastq format comes across as patronizing, and even vaguely insulting. People expect an archive to manage data in a transparent way and SRA does not do this. Perhaps SRA is misnamed, or the mandate to handle other responsibilities besides simple storage complicates matters. Regardless, for researchers struggling to work with the SRA's limitations it comes across as tone deaf.

I mentioned above that I don't think NCBI appreciates the burden SRA creates for the community. Undoubtedly SRA has enabled countless projects and collaborations that would have been impossible otherwise. But when people have trouble and come across a thread like this they tend to not engage and simply find another solution. There's a "you're doing it wrong" tone in the explanations above that tends to shut conversations down, rather than serve as an explanation of a difficult technical choice and how it can be worked around.