Closed JWDebler closed 1 year ago
Hi You can use samtools for the extraction of fastq from bam. e.g. samtools fastq input.bam > output.fastq
To filter duplex and simplex I think you can use the tag mentioned in the release notes (https://github.com/nanoporetech/dorado/releases/tag/v0.3.0). samtools fastq input.bam -f dx:i:1 > duplex.fastq samtools fastq input.bam -f dx:i:0 > simplex.fastq
Best regards Rasmus
yes I think that's the best way to go. there's also the --emit-fastq
option but that generates a single fastq files with both simplex and duplex reads, and the duplex read names have a ;
in them... the thanks for the suggestion @Kirk3gaard !
@Kirk3gaard unfortunately that didn't work. Both commands output everything.
samtools fastq barcode01.duplex.bam -f dx:i:1 > barcode01.duplex.fastq
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 605666 reads
samtools fastq barcode01.duplex.bam -f dx:i:0 > barcode01.simplex.fastq
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 605666 reads
This is with the latest samtools 1.17
I had a look at the bam file and the dx tag is certainly in there (and there are dx:i:0 and dx:i:1 tags present):
samtools head -n 1 barcode01.duplex.bam
@HD VN:1.6 SO:unknown
@PG ID:basecaller PN:dorado VN:0.3.0+88df11b+dirty CL:dorado duplex /opt/dorado/bin/dna_r10.4.1_e8.2_400bps_sup@v4.1.0 pod5/barcode01
eafa64e2-692b-4105-836b-7c62f9338768 4 * 0 0 * * 0 0 ATGTTACACCTAAACAAGTTACGTATTGCTAAGGTTAACACAAAGACACCGACAACTTTCTTCAGCACCTATCTTCCATCGTTGCCATCAAAATCTCCACTGGGGGGTACAACGACCGGCGGGTCAGGTAGCTGTATCGTCGTCGCATCGGCACGCTTTCGAAGATGCATCTTCAGGAATCTGACATAACTAGCAGAAGGGTGTTTTTGATGGACGAGCGAGTAATAACGAAAGTCAAGGGCAAGGAAGGGCGTAGACGGGAGGAGGACTAGCCCGTGTCGACAAGTCGGCTGATCTTAGCAGCAAGCACGCACAACAAAGGATCCAATGTAAGCTTAAGCAGGTGGATGGGACAGGCTGCAGCAGGGCTGCGGCCTCGCTATGCGTCAGCGTCATAACGTGGGCGGAACCTTCTTCGCTATCCAGAGTGAAGATAGGGCCAGCCAGGTGCTGGAGGGCTTCCCAGCCCAATTGCTTTCTTTCAGCATGCTCTGCGCCACGAGCGGGGTATACCTGTCGCTTGCCAGCACGACAGCTTCGTGCGTACCAACCGTCAGGATCTCTAGAAACGACTGGTCCATGTTAACCGGTGAGGGAACGCAGAGTTTGTGGCCAGTTCCAGGGTGGCTATTGGAGGAGAAACATGTCAGGGAGGACCTGCAGAATCTGGCGCGTTGTGTCCAGGCCAGTCTGAATAGATGTGTGCCAGGGCTGGGGAGGGTTCAGCGATAGAGAGGTGGCTGAGGAAGGAGGGTCGGAGAAGGCGGGCCGGAGAGCGTTCAGCTAGACCCTGAAATGGACTAGTGGCCGCCTCAATCAGCAAGTGCAAGGTCAACGGCTCGTGCTGACAGGGCCCCCGATTGCACGTACGCCTGAGCTATTCGCTGCTTACCCTTCGAGCCGCACAAGCTCAACTCAAACGTATCACCTTTGCTGGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCTTAGCAATTGGTG ##$&&%$$##$$%%&((&&)&'+,,11114::=???>A@@@ABA???@@>;;;;<>>?>=?DBB?>=<;::::;=<;==<;;;;:<<<<@ABFGHD?>>>?A?;885446:==<;<::;;88999<?@>>==<==>?555223338::<=<;;<::<;:1777:>@??<=<<:.--..<==A@?>>????@AAA@??A=:11111210''''33356666999:==>@=<<<<>A@?@<;>;9<;;;=<<?;89:9:=4444=>??=<=;<;:::9:;<=<<<<>?@><;;:;=>>@B>?BBCCCB@@@?<<<;<=>?ED@><<:;<<<=<<<<<=@>>==>??@?>;;=>====>?>??@?>??=<<<;;>=<=??=:7777898889<>==<=>>>@=<:9856:;;;;;>=???==>=>>@BAB@@@@:::::98878:<=AAA@???>>>>=<<<<>@?@<<?>>>>>??<<<=<@@@:CCB@?@>>?>??<<<<;<<;:;::;:<<==>>???=<<==?>====>><<554423333966212,+(())101238898:;==????>>>?>=<<<==?@@AAA>>=>??=;;=<<>>==>>???7.../1///4433=========>?>>=<=>>?AA@@?>,++++65555587899:=>>CB@???<<===<;:<;<=========>@DA<<<<>@B@?>;:;;9=/'''()(((),,,+/11,,,,,8<?====;96730-,++*))(((((*334557776766678979:966777::999>=>BBBFCA@>====<<;;;::<::<<<BD::::;AAA??>?>>=<<;<<<<===>===<;;:;;<=<<>?AA=<<<;=<====@>?=>>>>??>?===<;<;<>=888779989;>=<<<=?@BB@CBB>>>>>/////*(()*+++2;9888:;AB@@?><;:::>=>>?>B843../6:8411263,(((' qs:i:16 du:f:2.62675 ns:i:10507 ts:i:10 mx:i:4 ch:i:1 st:Z:2023-05-10T06:35:08.411+00:00 rn:i:336 fn:Z:FAW43432_pass_barcode01_b53759ce_273d0cef_0.pod5 sm:f:91.0072 sd:f:25.619 sv:Z:quantile dx:i:0
Any idea? Cheers
Might be that the - f should be replaced with - d http://www.htslib.org/doc/samtools-view.html
Unfortunately, that is not an option available in the fastq
subcommand
fastq: invalid option -- 'd'
I also tried samtools view barcode01.duplex.bam -d dx:i:1 > barcode01.duplex.duplex.bam
as well as with dx:i:0
, but both gave empty files as output.
Can this please just become an option of dorado itself?
Hi @JWDebler - Can you try the following?
samtools view -h -d dx:1 duplex.bam | samtools fastq > duplex.fastq
samtools view -h -d dx:0 duplex.bam | samtools fastq > simplex.fastq
OK, @Kirk3gaard was almost correct, it was just the syntax of how to supply the tag that was slightly off. So this works:
samtools view barcode01.duplex.bam -d dx:1 > barcode01.duplex_only.bam
The additional :i
was the problem.
And taking it all the way to where I wanted it:
samtools view barcode01.duplex.bam -d dx:1 | samtools fastq | gzip -9 > barcode01.duplex.fastq.gz
samtools view barcode01.duplex.bam -d dx:0 | samtools fastq | gzip -9 > barcode01.simplex.fastq.gz
The help of samtools fastq
says if you name your output fastq.gz
it automatically compresses it, but that isn't working for me, or I was misunderstanding what that meant, therefore the pipe through gzip.
non of the commands above worked for me so I did this and it worked!
to count duplex reads samtools view -@ 5 duplex.bam|grep -i 'dx:i:1'|wc -l 257180 to count non-duplex reads samtools view -@ 5 duplex.bam|grep -i 'dx:i:0'|wc -l 4795637
dividing the duplex/(duplex+non-duplex) my case 5%
Seems like simply doing samtools view and include the "-Ofastq" might be able to do the trick?(https://github.com/samtools/samtools/issues/1854)
Yep, can confirm,
samtools view -Ofastq -d dx:0 barcode01.duplex.bam | gzip -9 > barcode01.simplex.fastq.gz
works.
I had a little different but probably related problem, but it was resolved. I tried to convert a aligned modbam to fastq.
my steps:
dorado basecaller -r $model $pod5 --modified-bases 5mC --reference $ref > mod.bam
samtools sort -@16 -o mod_sorted.bam mod.bam
samtools view -h mod_sorted.bam | head -n 1000 | samtools view -bS - > test.bam
I extracted 1,000 rows(970 records) from Dorado aligned modbam and created a smaller bam file for testing. samtools flagstat test.bam gives this output:
970 + 0 in total (QC-passed reads + QC-failed reads)
108 + 0 primary
815 + 0 secondary
47 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
970 + 0 mapped (100.00% : N/A)
108 + 0 primary mapped (100.00% : N/A)
I tried fastq conversion in two ways:
samtools view -Ofastq
samtools view -Ofastq test.bam > smtls_view.fastq
wc -l smtls_view.fastq
620 # ->155 alignments(=108 primary and 47 supplementary alignments) (no secondary)
samtools fastq
samtools fastq -T '*' test.bam > smtls_fastq.fastq
wc -l smtls_fastq.fastq
432 # -> 108 primary alignments (no secondary/supplementary)
Calling default samtools view -Ofastq
gives the extra 47 supplementary reads, which are shorter than the primary reads with the same readID.
Minimap2 will hard-clip the unaligned bases for supplementary alignments and the SEQ column in the bam file does not include that sequence. So it would be difficult to get back to the original full-length sequences.
There is no need to keep supplementary reads in fastq, so the default samtools fastq
or samtools view -F2048 -Ofastq
will do.
samtools view -F2048 -Ofastq test.bam > smtls_view_nosupp.fastq
wc -l smtls_view_nosupp.fastq
432 # -> 108 primary alignments (no secondary/supplementary)
In any case, It would be great if Dorado could output both raw fastq and aligned bam files in one run, like in Guppy.
Thank you in advance, Erina
I'm a little confused, I am trying to do duplex calling and was hoping to get something like a
dulex.fastq
and asimplex.fastq
out of it which I can then use for assembly or mapping. Do I have to extract that from the generated bam file? And if so, how? Cheers