nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
504 stars 60 forks source link

fastq output? #199

Closed JWDebler closed 1 year ago

JWDebler commented 1 year ago

I'm a little confused, I am trying to do duplex calling and was hoping to get something like a dulex.fastq and a simplex.fastq out of it which I can then use for assembly or mapping. Do I have to extract that from the generated bam file? And if so, how? Cheers

Kirk3gaard commented 1 year ago

Hi You can use samtools for the extraction of fastq from bam. e.g. samtools fastq input.bam > output.fastq

To filter duplex and simplex I think you can use the tag mentioned in the release notes (https://github.com/nanoporetech/dorado/releases/tag/v0.3.0). samtools fastq input.bam -f dx:i:1 > duplex.fastq samtools fastq input.bam -f dx:i:0 > simplex.fastq

Best regards Rasmus

tijyojwad commented 1 year ago

yes I think that's the best way to go. there's also the --emit-fastq option but that generates a single fastq files with both simplex and duplex reads, and the duplex read names have a ; in them... the thanks for the suggestion @Kirk3gaard !

JWDebler commented 1 year ago

@Kirk3gaard unfortunately that didn't work. Both commands output everything.

samtools fastq barcode01.duplex.bam  -f dx:i:1 > barcode01.duplex.fastq
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 605666 reads

samtools fastq barcode01.duplex.bam  -f dx:i:0 > barcode01.simplex.fastq
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 605666 reads

This is with the latest samtools 1.17

I had a look at the bam file and the dx tag is certainly in there (and there are dx:i:0 and dx:i:1 tags present):

samtools head -n 1 barcode01.duplex.bam 
@HD VN:1.6  SO:unknown
@PG ID:basecaller   PN:dorado   VN:0.3.0+88df11b+dirty  CL:dorado duplex /opt/dorado/bin/dna_r10.4.1_e8.2_400bps_sup@v4.1.0 pod5/barcode01
eafa64e2-692b-4105-836b-7c62f9338768    4   *   0   0   *   *   0   0   ATGTTACACCTAAACAAGTTACGTATTGCTAAGGTTAACACAAAGACACCGACAACTTTCTTCAGCACCTATCTTCCATCGTTGCCATCAAAATCTCCACTGGGGGGTACAACGACCGGCGGGTCAGGTAGCTGTATCGTCGTCGCATCGGCACGCTTTCGAAGATGCATCTTCAGGAATCTGACATAACTAGCAGAAGGGTGTTTTTGATGGACGAGCGAGTAATAACGAAAGTCAAGGGCAAGGAAGGGCGTAGACGGGAGGAGGACTAGCCCGTGTCGACAAGTCGGCTGATCTTAGCAGCAAGCACGCACAACAAAGGATCCAATGTAAGCTTAAGCAGGTGGATGGGACAGGCTGCAGCAGGGCTGCGGCCTCGCTATGCGTCAGCGTCATAACGTGGGCGGAACCTTCTTCGCTATCCAGAGTGAAGATAGGGCCAGCCAGGTGCTGGAGGGCTTCCCAGCCCAATTGCTTTCTTTCAGCATGCTCTGCGCCACGAGCGGGGTATACCTGTCGCTTGCCAGCACGACAGCTTCGTGCGTACCAACCGTCAGGATCTCTAGAAACGACTGGTCCATGTTAACCGGTGAGGGAACGCAGAGTTTGTGGCCAGTTCCAGGGTGGCTATTGGAGGAGAAACATGTCAGGGAGGACCTGCAGAATCTGGCGCGTTGTGTCCAGGCCAGTCTGAATAGATGTGTGCCAGGGCTGGGGAGGGTTCAGCGATAGAGAGGTGGCTGAGGAAGGAGGGTCGGAGAAGGCGGGCCGGAGAGCGTTCAGCTAGACCCTGAAATGGACTAGTGGCCGCCTCAATCAGCAAGTGCAAGGTCAACGGCTCGTGCTGACAGGGCCCCCGATTGCACGTACGCCTGAGCTATTCGCTGCTTACCCTTCGAGCCGCACAAGCTCAACTCAAACGTATCACCTTTGCTGGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCTTAGCAATTGGTG   ##$&&%$$##$$%%&((&&)&'+,,11114::=???>A@@@ABA???@@>;;;;<>>?>=?DBB?>=<;::::;=<;==<;;;;:<<<<@ABFGHD?>>>?A?;885446:==<;<::;;88999<?@>>==<==>?555223338::<=<;;<::<;:1777:>@??<=<<:.--..<==A@?>>????@AAA@??A=:11111210''''33356666999:==>@=<<<<>A@?@<;>;9<;;;=<<?;89:9:=4444=>??=<=;<;:::9:;<=<<<<>?@><;;:;=>>@B>?BBCCCB@@@?<<<;<=>?ED@><<:;<<<=<<<<<=@>>==>??@?>;;=>====>?>??@?>??=<<<;;>=<=??=:7777898889<>==<=>>>@=<:9856:;;;;;>=???==>=>>@BAB@@@@:::::98878:<=AAA@???>>>>=<<<<>@?@<<?>>>>>??<<<=<@@@:CCB@?@>>?>??<<<<;<<;:;::;:<<==>>???=<<==?>====>><<554423333966212,+(())101238898:;==????>>>?>=<<<==?@@AAA>>=>??=;;=<<>>==>>???7.../1///4433=========>?>>=<=>>?AA@@?>,++++65555587899:=>>CB@???<<===<;:<;<=========>@DA<<<<>@B@?>;:;;9=/'''()(((),,,+/11,,,,,8<?====;96730-,++*))(((((*334557776766678979:966777::999>=>BBBFCA@>====<<;;;::<::<<<BD::::;AAA??>?>>=<<;<<<<===>===<;;:;;<=<<>?AA=<<<;=<====@>?=>>>>??>?===<;<;<>=888779989;>=<<<=?@BB@CBB>>>>>/////*(()*+++2;9888:;AB@@?><;:::>=>>?>B843../6:8411263,((('   qs:i:16 du:f:2.62675    ns:i:10507  ts:i:10 mx:i:4  ch:i:1  st:Z:2023-05-10T06:35:08.411+00:00  rn:i:336    fn:Z:FAW43432_pass_barcode01_b53759ce_273d0cef_0.pod5   sm:f:91.0072    sd:f:25.619 sv:Z:quantile   dx:i:0

Any idea? Cheers

Kirk3gaard commented 1 year ago

Might be that the - f should be replaced with - d http://www.htslib.org/doc/samtools-view.html

JWDebler commented 1 year ago

Unfortunately, that is not an option available in the fastq subcommand fastq: invalid option -- 'd'

I also tried samtools view barcode01.duplex.bam -d dx:i:1 > barcode01.duplex.duplex.bam as well as with dx:i:0, but both gave empty files as output.

Can this please just become an option of dorado itself?

tijyojwad commented 1 year ago

Hi @JWDebler - Can you try the following?

samtools view -h -d dx:1 duplex.bam | samtools fastq > duplex.fastq
samtools view -h -d dx:0 duplex.bam | samtools fastq > simplex.fastq
JWDebler commented 1 year ago

OK, @Kirk3gaard was almost correct, it was just the syntax of how to supply the tag that was slightly off. So this works:

samtools view barcode01.duplex.bam -d dx:1 > barcode01.duplex_only.bam

The additional :i was the problem.

And taking it all the way to where I wanted it:

samtools view barcode01.duplex.bam -d dx:1 | samtools fastq | gzip -9 > barcode01.duplex.fastq.gz samtools view barcode01.duplex.bam -d dx:0 | samtools fastq | gzip -9 > barcode01.simplex.fastq.gz

The help of samtools fastq says if you name your output fastq.gz it automatically compresses it, but that isn't working for me, or I was misunderstanding what that meant, therefore the pipe through gzip.

husamia commented 1 year ago

non of the commands above worked for me so I did this and it worked!

to count duplex reads samtools view -@ 5 duplex.bam|grep -i 'dx:i:1'|wc -l 257180 to count non-duplex reads samtools view -@ 5 duplex.bam|grep -i 'dx:i:0'|wc -l 4795637

dividing the duplex/(duplex+non-duplex) my case 5%

Kirk3gaard commented 1 year ago

Seems like simply doing samtools view and include the "-Ofastq" might be able to do the trick?(https://github.com/samtools/samtools/issues/1854)

JWDebler commented 1 year ago

Yep, can confirm, samtools view -Ofastq -d dx:0 barcode01.duplex.bam | gzip -9 > barcode01.simplex.fastq.gz works.

is01 commented 10 months ago

I had a little different but probably related problem, but it was resolved. I tried to convert a aligned modbam to fastq.

my steps:

dorado basecaller -r $model $pod5 --modified-bases 5mC --reference $ref > mod.bam
samtools sort -@16 -o mod_sorted.bam mod.bam
samtools view -h mod_sorted.bam | head -n 1000 | samtools view -bS - > test.bam

I extracted 1,000 rows(970 records) from Dorado aligned modbam and created a smaller bam file for testing. samtools flagstat test.bam gives this output:

970 + 0 in total (QC-passed reads + QC-failed reads)
108 + 0 primary
815 + 0 secondary
47 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
970 + 0 mapped (100.00% : N/A)
108 + 0 primary mapped (100.00% : N/A)

I tried fastq conversion in two ways:

  1. samtools view -Ofastq

    samtools view -Ofastq test.bam > smtls_view.fastq
    wc -l smtls_view.fastq
    620 #  ->155 alignments(=108 primary and 47 supplementary alignments) (no secondary)
  2. samtools fastq

    samtools fastq -T '*' test.bam  > smtls_fastq.fastq
    wc -l smtls_fastq.fastq
    432 #  -> 108 primary alignments (no secondary/supplementary)

Calling default samtools view -Ofastq gives the extra 47 supplementary reads, which are shorter than the primary reads with the same readID. Minimap2 will hard-clip the unaligned bases for supplementary alignments and the SEQ column in the bam file does not include that sequence. So it would be difficult to get back to the original full-length sequences.

There is no need to keep supplementary reads in fastq, so the default samtools fastq or samtools view -F2048 -Ofastq will do.

samtools view -F2048 -Ofastq test.bam > smtls_view_nosupp.fastq
wc -l smtls_view_nosupp.fastq
432 #  -> 108 primary alignments (no secondary/supplementary)

In any case, It would be great if Dorado could output both raw fastq and aligned bam files in one run, like in Guppy.

Thank you in advance, Erina