Closed lkwhite closed 7 months ago
Hi @lkwhite,
We generally discourage users from emitting FASTQ, and ask people to output as SAM/BAM because FASTQ has some fairly major downsides:
I understand this can be a nuisance however if you are dealing with other people's data. In your view what is the reason that the community uses FASTQ instead of BAM?
Hi Mike,
I understand where you're coming from, but I'd say expecting the bioinformatics community to make a clean switch from FASTQ as the default format for unaligned data vs. BAM files is a major lift. There are whole decades of tools built around FASTQ as the default standard for unaligned data. Tools for QC, preprocessing, and aligning data almost universally accept FASTQs as input, while only some accept uBAMs. I've showed multiple biologists how to rebasecall their data in the past 6 months, and all of them were expecting a FASTQ as output and would chose it if they knew the option was available. In people's heads it's just going to be the starting point for data analysis.
Separately, there's no clear community standard yet for what constitutes "raw" data when uploading to GEO or SRA (even just upload fastqs and fast5s was nonintuitive), but the default RNA-seq templates in those databases currently expect each sample has a compressed fastqs containing "raw" data (this is very Illumina centric, where raw means sequence info), and then some other "processed" data files that may be aligned bams or .txt files of gene expression counts, etc.
But again, the headers from live basecalling would work fine, this is a perfectly acceptable output, and the reproducibility aspect merits the file size increase: `PAU71608_pass_dffec1fe_d2ea83c4_0.fastq @07f04698-5b0c-422f-a8ae-04752b107827 runid=d2ea83c41e3698331a233f7467ce8cb6db25f519 read=15 ch=2185 start_time=2024-02-01T12:32:05.290568-07:00 flow_cell_id=PAU71608 protocol_group_id=20240201_BestProject_004standards sample_id=BestRNA parent_read_id=07f04698-5b0c-422f-a8ae-04752b107827 basecall_model_version_id=rna004_130bps_hac@v3.0.1 CAGUAGUUGGUGGAGCGUUUGCUAAGGUAUGGGGUAUGGUAGUCAAUUCCGAAAUAUUCAUUCUUUAUAUUUUAUUAUUAUUAUCAUUCAUUUCAUUUUUACAUUCCC +
To echo what @lkwhite has said, although BAM may be a better standard format, this is not something universally accepted by the bioinformatics community and it will take a while to convince people that it is indeed better (and for the tools to be updated for use with BAM files). History has many examples of standards that seemed better at the time (e.g. Betamax), but were not adopted by the community for any number of reasons. Importantly, there needs to be a coherent ONT-wide approach and the EPI2ME workflows currently published by ONT do not support BAM files as the input.
There probably needs to be a strategic conversation at ONT to decide how best to deal with this, but there should not be a fragmented approach where some developers are deciding to implement a standard that is not supported by other data processing tools published by the same organisation. It is probably also best to consult with the community what the best approach would be as my assumption is that ONT does not have the bandwidth to develop all the tools currently available from scratch and there will be no buy in from open source developers if they are not a party to these conversations.
To answer @vellamike's question "In your view what is the reason that the community uses FASTQ instead of BAM?". Pretty much every single data processing tool available uses fastq as the input...
Perhaps a middle ground, is that when --emit-fastq
is used, the bam read groups are just added into the header so that something like minimap2 -y
can read them into the resulting bam, much like using samtools fastq -T '*'
would pull them into the fastq. Your default is still bam, and you give users a choice while maintaining consistency across formats.
I agree with the points made by others. Tools, data repositories, and people, expect fastqs when dealing with sequencing data. That said, the subject of the issue isn't whether ONT should output BAMs or fastqs. Rather, when fastqs are made, and they will be because I'm sure the amount of people who never make fastqs from their ONT data is very small, it'd be best if they contained all the header info from live basecalling.
there's no clear community standard yet for what constitutes "raw" data
This is another point of concern. Surprisingly, at least from my perspective, I have to convince people to keep their pod5 files and not just their fastqs. Otherwise, I find that most people (non-bioinformaticians) just take the fastqs out of the fastq_pass
folder and dump the rest of the data output :fearful:. That being the case, they'll also fail to describe how they went from pod5 to fastq, so it's better from a reproducibility perspective to have that info in the fastq.
Besides, I thought the discussion of having the model in the fastq header was solved a few years back when I asked Clive about it, and he said ont would do it, and then it was done and included. Then Dorado came out and it was taken away again.
Please put it back in.
Thank you all for the valuable feedback. We agree that the Dorado fastq header should match MinKNOW and include the basecalling model information. We will include this change in the next major release. Thanks again for your input on this!
This feature has been added with dorado v0.6.0. The fastq records now contain an RG
tag which has the model name.
I see this issue was closed but want to echo @jfnjdoh's request that
dorado basecaller --emit-fastq
output all of the same header info that would be produced by live basecalling during a sequencing run. At a minimum, this output should always include the basecalling model used.While I understand there is a workaround to emit this info using samtools described in the issue above, not having this be the default behavior presents a major issue for research reproducibility and publicly archived sequence data (which will always include FASTQs, but may only sometimes include pod5s). If I download someone else's ONT fastq data from GEO or ENA, it should always include which model was used to generate those basecalls.