fastq or bam format - Githubissues

nanoporetech / dorado

Oxford Nanopore's Basecaller

https://nanoporetech.com/

Other

477 stars 59 forks source link

fastq or bam format #363

Closed Meer9234 closed 11 months ago

Meer9234 commented 12 months ago

Hello,

What is the best format to use when rebasecalling? I read that .BAM files are sequence alignment maps whilst fastq usually comes from sequencing instruments. So if I just basecall and not align my sequences should I not use fastq format? Or is .BAM still fine? What is the difference and why is .BAM preferred (as it is the default output format)?

Just for my foundational understanding of bioinformatics. Kind regards.

tijyojwad commented 12 months ago

The default output of dorado is an unaligned BAM because dorado puts a lot of useful information about each read in the BAM tags (e.g. qs tag stores the mean qscore of the read). This BAM can then be used to generate a summary of the whole dataset using the dorado summary command. And as you mentioned, if alignment is enabled in dorado, then the BAM contains alignment information too.

However, if you're not interested for any of those, then passing --emit-fastq to dorado will generate a fastq output without any of the tags and alignment information.

If you're working with unaligned reads and don't care about tags, then fastq sound sufficient for your use case.

Meer9234 commented 12 months ago

Thank you! That clarified things :)

Meer9234 commented 12 months ago

I will close it

Meer9234 commented 12 months ago

Actually I have another question:

I did duplex basecalling on a v10.4.1 flowcell with the ligation kit v14 and around 20% of my reads got rebasecalled in Duplex, with better quality. So if I understand correctly now the two simplex strands and the duplex strand is present in the .bam file. Will this matter in the assembly? Or will this lead to a bias in the assembly? And is flye equipped to deal with this?

Thanks! Kind regards

tijyojwad commented 12 months ago

The duplex reads will cover the same genomic region as their corresponding simplex parents. so using both will increase coverage for the areas that have a duplex read. I don' think it will hurt the assembly, but you can use the dx:-1 tags to filter out simplex reads that have duplex offsprings (details here)

sklages commented 12 months ago

You may loose some longer reads when filtering out dx:-1. If you are just interested in genome reconstruction, then I'd go for the complete dataset, if you do some SNP calling or sth similar, I'd probably go for the dx:-1-filtered data.

Best would be to run two assemblies and compare the results. And then decide which dataset to use ..

Meer9234 commented 11 months ago

Ah thanks! Very clear!

Lucas-Servi commented 9 months ago

Great info over here, where can I find a more detailed information about Dorado such as the "--emit-fastq" param?

Great softeware.

Thanks!

tijyojwad commented 9 months ago

Hi @Lucas-Servi - the help text for dorado will have detailed information about options.

dorado -h
dorado basecaller -h

cement-head commented 9 months ago

@Lucas-Servi After you've demultiplexed, you can convert the dorado outputed BAM files into FASTQ files using SAMTOOLS

$ samtools bam2fq SAMPLE.bam > SAMPLE.fastq

Program: samtools (Tools for alignments in the SAM format) Version: 1.7 (using htslib 1.7-2)

sklages commented 9 months ago

better to take over any tags which might have been created, e.g. when running modified-base calling (samtools >= 1.16 / htslib 1.16):

samtools fastq \
    -T "*" \
    SAMPLE.bam \
> SAMPLE.fastq