hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

demux pod5s, correct workflow? #102

Closed jorisbalc closed 8 months ago

jorisbalc commented 8 months ago

Greetings,

I'm trying to demux a pod5 with skipped reads. I merged the pod5 and basecalled it using dorado and demux'd it the bam into fastq files. I convert the merged pod5 with blue-crab into a blow5:

blue-crab p2s can_pod5_merge.pod5 -o can_pod5_merge.blow5

I extract the read ids from the barcode02.fastq:

awk 'NR % 4 == 1 {sub("^@", "", $0); print}' barcode02.fastq > read_ids.list

Afterwards, I try to extract from the blow5 and get the following:

$ slow5tools get can_pod5_merge.blow5 --list read_ids.list -o barcode02.blow5
[slow5_idx_init::INFO] Index file not found. Creating an index at 'can_pod5_merge.blow5.idx'.
[slow5_idx_get::ERROR] Read ID '41b3832b-a976-4421-96ab-f425dd8ae044' was not found. At src/slow5_idx.c:522
[slow5_get::ERROR] Exiting on error. At src/slow5.c:2524

It manages to extract some reads, but stops at this particular read, wondering what could be the cause of a read getting lost. Any help is appreciated!

EDIT: it seems that every 4096 read batch, it extracts less and less reads, is there any way to avoid this? Why are the reads missing?

hasindu2008 commented 8 months ago

Hi there, Can you please mention which version of Dorado you used and the parameters? I am wondering if this is due to read splitting. To investigate and identify where the problem is, can I suggest you do the following and tell me the stats.

Could you grab the list of read IDs in the can_pod5_merge.blow5 by using:

slow5tools skim --rid can_pod5_merge.blow5 > blow5.rid.list

Check the count:

wc -l blow5.rid.list

Then get the list of readIDs from the Dorado BAM file as:

samtools view dorado.bam | cut -f1 > bam.rid.list
wc -l  bam.rid.list

Can you tell me the counts you get?

Then we can do the following

grep -F -f bam.rid.list blow5.rid.list > both_in_bam_and_blow5.list
grep -v -F -f bam.rid.list blow5.rid.list > in_blow5_not_in_bam.list
grep -v -F -f blow5.rid.list bam.rid.list > in_bam_not_inblow5.list

Get the read counts using wc -l for the above lists.

Also, could you post a couple of reads from the dorado BAM file as well as a couple of reads from the barcode02.fastq?

jorisbalc commented 8 months ago

Thanks for the reply,

Posting everything in order:

$ dorado -v
0.4.1+6c4c636

Basecalling:

$ dorado basecaller --kit-name SQK-NBD114-24 /home/v313/Dorado/models/dna_r10.4.1_e8.2_400bps_fast@v4.2.0 merge_all/ > calls.bam

Demux and grab barcode02 in this case:

$ dorado demux --output-dir fastq/ --no-classify --emit-fastq calls.bam

Count the reads:

$ wc -l blow5.rid.list 
569170 blow5.rid.list

Checking the bam reads:

$ samtools view calls.bam | cut -f1 > bam.rid.list
$ wc -l bam.rid.list
570778 bam.rid.list

Read comaprison:

$ wc -l both_in_bam_and_blow5.list 
567562 both_in_bam_and_blow5.list

$ wc -l in_blow5_not_in_bam.list 
1608 in_blow5_not_in_bam.list

$ wc -l in_bam_not_inblow5.list 
3216 in_bam_not_inblow5.list

Few short reads from the .bam:

d9f22985-c0ba-4fa7-a380-28aa96f72e1d    4   *   0   0   *   *0  0   AAGCACATCGAGAATAATAGTCCCATCAAGATCGCCTGAGAAGTCACAAAGATCTCATCAATATACATTCGAACTGAGGTACATATGCGCCACTTACATGGTGAGAGCCTAGCTTCATTATGGCGGTGCTGATGGTTTCTGGCAGTGTTGCTAAGGGGGCGACTAGCTGCAGCGGAGACATTGCTGGCTGATGTAGTAACTCCTTTCCAGCACATGGCTGACTGTTGTGGGTGGGGCCGGTGGTCAGGGGCCAGGTGGCGACTGATCCAGGATCCTGGTGTGGTCCTGGCGGGTGCGTGGCCGGCGTTGTTCATGGCCGCCAACGGCTCCGCGGTCAGCTAGTGCCGGTGCCGCGAGCGTGCAGGTTCCGGCGATTCTTCATTGCAGCTCATAAGCAACTTCTTGCTGACAGTGACCTGCGTAACCTTGTGGCGGCCAGCGGGTGAGCACGTGGTGTGTGGTCCGGTGAGTGCGGTGCGTGTACCACAATGCCGTAGCC )**,*($$$))'(1(&%'&%&+022,*(%$24*'')'&$%'/,#"$**234..*-(''%''*'(&)&$$(,,204*'((&%$#%%-,(&'*(&','&##"'-&$%%'*((%##'0&%%&(+)+,*)*-,)*-0,%&03('%%%&&&&()&$%&&&((,+)*()+%$%(&&%//-,+,&$#%*,&'''(@>>=52)'(&)'&$&*...5/-%)'&+')-1%%%&%)'##"&()%&&)&&&$''''$#&('(%*++-+%#%('&(%&'%(()*')#$$(/3,+'''%&%%&,)'''%$#'%"%'%)+(%)(''$%)&)+2'%(*)(&%&0('$&().'$$',%&'',-+&&))+*)%&&+/0,-+,,)*'&&&''%%%%((%%$&-313,+'''###%%$&()%&&)&%*16&%%(($&,,)')'(''(,-/&&,),.,(%%'*./,%&&+*#$%#''($%$%&$$&%$)&&%&'##$%'')%%#%'./**-42-&%&''& BC:Z:SQK-NBD114-24_barcode02    qs:i:6  du:f:1.914  ns:i:9570   ts:i:934    mx:i:3  ch:i:361    st:Z:2023-06-15T11:48:02.001+00:00  rn:i:14190  fn:Z:can_pod5_merge.pod5    sm:f:105.928    sd:f:28.1222    sv:Z:quantile   dx:i:0  RG:Z:654d949156458929eb0d88decfead14747d14885_dna_r10.4.1_e8.2_400bps_fast@v4.2.0_SQK-NBD114-24_barcode02
12d8c732-2f64-4949-9f85-0a51b45afc59    4   *   0   0   *   *0  0   TTCTCAGGCACTTCGCCGAATGGGTTCGATTTCATCAGCGTATAAATAACACTGGCTTCCAATCCGAGTAACTTACCTGCATATAATGGATACCCAAGAGCACGCATAGCAGATCTATTCGTTCAACGTCTGCAGGCTGCAGCATTTCTTCGTACAGGCACCGCCAGGTCGGCAAAATGATCCCGTATCTGATGATACAGCTATTTGACACAATACGGATGAACTGCATCAATTTTCAGCAGTAGGCGTCGCATCTTAAGCCGGCGCAGTCGCTGGTGCACGGTGCGAAGTCGTGGATCGCAGTCTGGCGCCTGATCCTGGAGGTGCTGCAGGTGATTGCGCCTGCTTAACATGGTAGGAGAGGCATTACTTCATGAACTTCAAAATCATGTGCAGCGCT    ;2%$&'%&%+**1/00''%52*+59872101/*+,+.0-..-0$#%($%$#$#%$&-,(%#&')%#&',''$%%$'')'&#$%,,01'&&,%%%-+*+-+)'%$(+*(*+/2/--5//2,$$&&&&*&%#&$'$,*'+,'#%$%''1%%&)%%%&#'),(&%%().)&%"##&&%(&%%&(()+($%,$%$%'(''))&%&($%%+''''+&&(%*'(*+&$+((&(%&#$$%'%%%$$(&%(##%0((%%$###%(()*11)(,&&'+((''*'&&$#"$#&'++(&*$+(&&(''+'&'%%%&'',1,'.,(((1230*)./.*)'+)$$*(%$%'(,,-636333,-.*+%((('')%%$&$&&&%$$**'(%&**+.)'$+-&##$$%&+$+(&'&    BC:Z:SQK-NBD114-24_barcode02    qs:i:6  du:f:1.5268 ns:i:7634   ts:i:898    mx:i:1  ch:i:365    st:Z:2023-06-15T11:47:53.569+00:00  rn:i:17171  fn:Z:can_pod5_merge.pod5    sm:f:97.325 sd:f:25.2987    sv:Z:quantile   dx:i:0  RG:Z:654d949156458929eb0d88decfead14747d14885_dna_r10.4.1_e8.2_400bps_fast@v4.2.0_SQK-NBD114-24_barcode02
bb281324-0a75-4d6b-bb9c-293c09686042    4   *   0   0   *   *0  0   TGATTAAACAATTTGTCTTCGGCGGCGAATGTGAAACGCCGGTGCGTAAGGCAAAGCGCCGGTGATTACCAGCATACGCGGCGCTGCGTGGTTATGGCCACAAGAGTAAAAACGTAGGCAATTGGCGCATCATCCTAATGCGACGCTTGCACGTCTTATCGGCCTACAAAGAGTGCCGGACCCGTAGGCCGGATCACCGCGTTCACGCCGCATCCGGCAATAAGTGCTCCGATGCCTGATGCGACGCTTGCCGCGTCTTATCAGGCCTGCAAAATGTCCCAGGACCGCGGTAGGGCGGATCGCGTTCACGCCGCATCCGAGCAATAAGTTAATGAGCGCGACTATAACCTTGCCGGTGGTTTCGCCAGCACCGGAGTATCCGCCGCTTGTAGCGCTATAGCGACCGTACAGGCGGGCGACGAGCAGCATCGCGCGGCGAACGGGAAGGAGCGAGGGCGGCAGGCAGCGCCACGTAATTATATGCCAGCCCCAGCGTCAGGCGTCGTACCTCCGCAATATGCGCGAGCAGGCGCTACCGGAGGCGACGCCTGCCACCTCCAGGCACCTAATAACTCTCCGCCTTTGGCATTTTGTAGTGACAAATCTGTAGCGGTGCTGAAAGGGCAAATAATCCCGCGCAACGAATAAAGCGAAAATAAGCGACATTGTTCGTGCACACAGAAAGGCGTCAGTGCCAGTACAATTATAAAGTCAGTCACTGCTGCAATGCGCCGTGGTGAATAACGTCCTGAAATCCTGCCACTTAGCGTATTTTTCCCAGCACCATCCCTAGCCCAACTAACATCATAATAAAGGTCATCGCCGTTTCCGAAAAACCGGAAATAAACGTCATGTATGGCTTTACGTGGCTGAACCGGCAAACACACTGCATTGCCAAACATCGTGGCGGCGAAAATTAACCACGGGGCCGGGCTACGCAAAAAGTGAAATTGTTCGCCGCGCAGATTTCCTTCGCCTCGTCGCGAATTGGCACCAAAAATGACCGTGCCATCACCGCAATATTAAAAACAGCGATCGATAAAAAGGTGTAACGCAGCTTGATTCCTGACTTAAATACATTTAAGCGGAATGCCCAGGCCGGATTGGCGACTGTCATCCCGGAAACCATCCCCGCCACGGCGGCGGTGACTTTTCCAGGGTTTGATAATTTTTATAACACGATCGCTCCGACGCCAAAAATACGCATGCGGAAAGCCGGATACCAGCCGACCAATGGCGAGCATCAGGTTAAGACGAAGAGAGCGTGAACATGGCGTTGCCAATCTAACGCACAACGCCACCAGAAACAACAAGATATGTTTAAGTAAGTAGCGGCTCAAAAGAGTGATCATTGGCGCACCAACCACCACCATGCATAATACGAGATCATATACCAGGCGGCGGAATCGAAATTC =11A?=BHA))+::0-*+.))*A:<=9:>77A>>?@;668502.23<5359,('22((0.-12'&')$%((')*'&(&),,145686441.1,+-5:<;/.+)),/1+-200/'###','&(7289;<11144.-.*-/<8102655?32/1.107<663(')40(((+05'&(*(*&$(&103351418/+&'%$$%&'.3032978<8(897=?:95464227566:=76+855;+,,.-('(144:7=@988@<<=9<;9;;4AS56459;6+*$$&)54.,520&/$004??5456**)+*+.,-*)-,++2150(%'''))%$%($.0)'(&,,('-401,-.561))/**)*'&()*2<>=;9<=9994:98=>BACA=;6B:2+**'')'21245466+++41342,,-,+-.&%'-%%'+,,)(%*')*)(+,)&&')'&$'($&+188:;<97:<*((,++72176<9622/00=?:93:9:*)(.-('''&()''&&'(*)-,-$%-%%%.+0%,+'')*+(((('(+)(%'%'%&$%&+,,1777F31)*+5313;333*))+)-*20//@=:&%&&('.;9:''('&)*(&)))%'(.-()1554.,//37;:+&')''*+1240/.'&'''.-)---.&&00/,)((('''&&%&*)%$$,3*($(*((+&'#%%(%$)'*662008H::<JC@?=>=A42358:99788>A;<=;612+('+,+(').69769;CAAAJESBEJ=?::<<;/..&%%1('2;;;>;9:--/B@;=DA5::?DCA@@ABB@@6440139::8>8/--)((3.-042.*)/15/-,-0788:,,76'&%./-107//1=:7&%%,%%%61/346+*577>602++%+*%%%0/04497.-.36?@A999326?<;7358A-,,2>FBC86A=6,-+/++/990/.+;;2,(()''/,))),33113<6582./.6;.-15220.31)'''-..%%0).-''&*&#'*)$%4234;;:=?67412:610**-())5++,&&0=?@FG?=:;=;9%((01''(*'&'&%&(20677;:(&'/%#%&001,+*''5*)'%(&'(#%&&(%$#$)746@<>?,-<=?>?7858;D=:>BBBEB;;<;;<:54578<11'''=:85,'/./07891%%1799<577=433AB<>><<;94=;;;(''0('#$%,3<=:00995-+-/)*+6:88+'%%3342200)*''&/)%$%''&(*+(+(2644459;734)*9335/.+*())&$&()(+4223245767;:6?;:556499=:43**+4'%%'%$%'$$$%&&)%$%,/1+.,+*,(%%+/.0.....)(+9<<?769:22;:@=A>99..6<=32222+*22116875((02.0-,54011 BC:Z:SQK-NBD114-24_barcode02    qs:i:10 du:f:3.8034 ns:i:19017  ts:i:1018   mx:i:3  ch:i:363    st:Z:2023-06-15T11:47:54.376+00:00  rn:i:289157 fn:Z:can_pod5_merge.pod5sm:f:105.054    sd:f:29.4775    sv:Z:quantile   dx:i:0  RG:Z:654d949156458929eb0d88decfead14747d14885_dna_r10.4.1_e8.2_400bps_fast@v4.2.0_SQK-NBD114-24_barcode02
d3d13a47-f82c-44ec-863c-c14c9b50489a    4   *   0   0   *   *0  0   TGGTCACTGTCCGAGGATACGCACCCCCCGCACTCATGCCGCGAATAAACTCCTTGTCGGTAATCGGTTTATCGACAGTCACCAGATACTTCTCATGATCATTGCCAGCACGCAGTCTTGTTCACCAGATCGCCGTGATTGGTCAGGAGAATCGACCCCTGGGAGTCTTTATCCAGGCGGCCGATCGGGAACACGCGTTGCTGTGGTTAACGAAATGACGGTGTATCGCGCTCGCCATCTTCGGTGGTTGCTGCAATACAACGGGCTTGTTCAGGGCGATAAGTACCAAATCTTCGGCTTGCCGAGGTTCCAATAAATTGACCATTACTTCACAACGTTGGCCGGGTTTCACCTGATCGCCAATGGTGGCTCGCTTGCCATTAAGGAACACATTGCCTTGCTGATCTAGCGATCCGCTTCGCGGCGTCAGCAAATTCCAGCTCGCTGATGTATTTATTTAAACGGACTGATGAGTTGGGCAGCATAAATTCTCCTGTAAAAAACGGAATATACCGTACCTTTGGGTTGATAAAAAATAGTCATGGCGGACTACTTGTTATTTAAGTGGCTTGAATTCGTTACCTGCTCAAATCACTTTCCAGAAGCGATGTGAGCATTCGTGAGCCATGTCGATAAGTCGTTAATTCGCAGTGGTGACATGCCTCCGTGATTTCTTACTAACCGGAGAAGTAATGGAACTGCTTTTACTCGTAACTCGACACTGCCCGGGTAAAGCCTGGCTGGAACATGCACTGCCGCTCATTGCTGGAACGGTTCCGGGTCGCCGCTCAGGCGGTGTTTATCCCTTTCGCTGGCGTAACGCAGACCTGGGATGATTACAGCAAAAC    8.;<;:8866**+++.+*&##&'((48=B767<;>>?<;:++)*.2)*,)0%&''''$$%453561124512+''(&%$*/13+*-3342511268001,**1*+/-99)))9''&1:7('(.3449;78?9<<:<:;=90//%%'22,/5,('(((BBD:<B=;<555ECHBD;:6/'''<.++%%&CB>=:<443;>@><=C:;8965342745'%%'))'+'(%%%&&%'''34*+4;6013720/-+*'&#%'((#&3.2766@977-,04DB(((*)(*&&'1/67469611075&&7447?=01-)-1006.-,./5((*)&&2/55334.-(&()438C:672+++6100=67::56@ABBA@>)&&+G87:.-.)(%&%*/.--01-.07;886&%&%%%**++,*)&#&0-//,-...+)'.%%+%1)&%$$$$5;RG@?>>666DCADE=;==.-,76313/.,1%%&;:510*(%/.+666>CGGB?CSI<B?547<:778;879:AA?@A-,76844466?>496****21(()(&(%-,+(+/;=:::>.+'(&'&)+()/742349278BH@=(--988889CDD@:55=.;(((67((*)*0)%%&),.4?::32)+''(*)(%%'/29975&&$(&,53$#$*/('(93846--.52*))478/02@;=AD<;;=9;?657B?>=559ED,'($$%')+/0***+&#$$%+'':9)&(*()5/..333511))'*)-,,,2434-++3632)'%.:-**,&''$591,+,0*-+,,,'878///:67101/---,,-+(++,,44/.+-.146%%/17+))+*)$&'##191    BC:Z:SQK-NBD114-24_barcode02    qs:i:10 du:f:2.3804 ns:i:11902  ts:i:1024   mx:i:4  ch:i:349    st:Z:2023-06-15T11:48:00.377+00:00  rn:i:8774   fn:Z:can_pod5_merge.pod5sm:f:95.0257    sd:f:26.9928    sv:Z:quantile   dx:i:0  RG:Z:654d949156458929eb0d88decfead14747d14885_dna_r10.4.1_e8.2_400bps_fast@v4.2.0_SQK-NBD114-24_barcode02

And a few reads from the .fastq:

@0fb1396a-b875-4400-9f92-84abf01dde3c
CTTCGACGATGAACCGCCGCCGATGGAATGATGACTGCTGTGAGCGGCTTCATCGCCATTATCACGTGACTTGCATACCGTTTTCATGCAGCGGCAGCTTCTTCATTATGAAAACGCATGCCGTTGTCGCGCAGGGAAATCAGCTCATGCCGTCCCTGTGCCAGCAGCAGGACGCTTGCGCGCTCTTCTACGTCGCGGAATTGGAACGGAACACTGTCAATACGCGGTGGCAGCTACGTACCTGCAAGACCCATAATAATGGCGATATAGGTGTGGTGGCCTTACCCGTCAGCGACAGTGACGCCCTGGGACATCAGGCGGTATAAGCGAGTAACGCTATCCAGTAAGCCTTTTGACAGAATCATCGACGAACTGTCCTTCATAGGCCTGCGGTATGTGGGAAAATGGAACCAAATTACCACCTAGACGTGGATCCGCTGATCTGATAATGTCCTGACAAGGTGAACTGACTAGGTGACGAACTAATAACAGCGCATAGTGTAGAGGGGAACGCGGCGACTGGCTTAACTCGCATCGAATTAAACTAATCAAAGCCGCGATTTAACTAATATTGCGATGATTGTATCAAGTGGTGTGTAGATGTAAAGATAAGTCTCGTGGTTTCACACCAATTTGCCAGCGCTTCACCGACGCCTACGGTCGTCCCATACAAAAATACTGTTCGTCCAACGCCATGCCGATGAATCACGCGGCAGTGGTGCCTCCGAGGGTGATGACGACCAGATGTAATGCCTGGGCCAGCGGCATTAAACGCCACCAACTTAGTTCACTGGCGCGATACGGCAGATCAGGCGAGGATAATGCCGACCACTGGGTGTTGCCTGGTGACCAGTGACGCTATCGACGGGCGGCAGCACGCCAATAACTTCGCGGCAGGCGGAAGAGCGGTGCTCGCGACACCTCGGCTTCGCGCAGCGCGGCAGCGATGGCTCCGATGCGCTGCGCGAACTTGAGGAGGTGCCTTTATACCGCCTTGCCGCCGTTTGAAGTTATCGGCGTGCTGCCGGCCCGTCGATCACGTCACTTTGAGCCTTAGGGTAACCCGGTGGCGGCTCCGCCGATCTGCATATCACCGTGAGTGAAGTCTGGCGTGTTCGATCCCGCTCGCCGGCGTTACATTCTGCGGTGGTATCACCTTTGATCTGCCGCCGTGACGTCGCATCAGGGTTGTGGCTGTCCTGGTACGAACAGTCTTTGTATGGGAATGACCGCGGCGACGTCGTCAGCTGGCGGCTGCAAGATTGGTATGAAACCCTAACTGGTACCTGTTACATCTCAAAACGCTACTAGAACAATCGTCCGATCGGCTCGATGGCGGCTTTTGCAACATAATTGATCTGGATGACGCCGTCGCGCCGCGTTCAGCCTTACACTATGCGCTGCTTATAACCGTTCTTTGGAAGTCCGTCACCTTGTAGGACTATTATCATAATTAGTCTGTCGACATGTTTAAGGTGGGGTGACCCTCATCTGCCGGCGTGAGCCTATGAAGGTCGACAACGTCGATCATCTGGTCGAAAAAGCTTACTGGATAGCGTCTCACGTTGCATCAACGTTATCGTTCACTCGCTGAGCGGTAAAGGCCAGCCGCACCGATATCCGCCATACGCTTATGGACAGCTGGGGCAGCCTGCCGCAACGGATAATGGACGTGCTCCCGGTTTATTCGCAACGTAAAGCGCGAAGCGCCTGCTGGGACGGCATGGGAGTGGATTTCCGGCACGACGACGGGATGCGTTTCATAACGGCAACCTACGGTTATGCAAAATCCACGCCTAACGGCAATGAATCATCTACAGCAAAACGCCCGGCGGCGTTTATCATTGGAGAGTATCAACTGCAACTGATCGTGTTAAACTAGCCGATAAGT
+
551*./3((()'20)%$&%$&'%%('(*'&$$#$$&$)&(.%%&%$""%,''3../6&&(33-$++/-+&%''().,-/048;>/,,,0&('+,-+)')+'*%$$$&%&&-/031'%%-'%,,-047%%%+-%''5:3-((''((##((-((%(-.00623)%&,)(+*+(&%%%$('%$')6839:661-'&'%%),-&%''%&.,%*..+(')'%#"#&'$%&%((+**%%%(&$$'$'(2++**-+''3-**01(()+...(('(%'+//28;<7334:7/34811*((,.11(''0+)'$$%#$'()$&'%##%&&'&%$#$$%%$&-.47../%',,+,.--/1..2,,&&*+),0(,*+''*))+3.0,'$'(,-//$$(,/0&%$())(%('.01/+()&$$&'''%%$##%++))4((*(%%##$''#""$#$%$#""$'&&$$%.31140+,54(())0+'''#%''$$$)&(.2/13555,-.))&($$%-(#"#&)7;=75**+./112;02267**--#$&'-%%%0*))54++,46-+*:2&(//(()012447..2,,,5?78:434710'&#$%'-%%%&'&2../...%$+.+*$###(($#%$(*)))++,55044=A43&),0.0%'%++('%$#"',%%$***'*&1850.1)*/5/-,/..,,'%'%%+'(%%'*)&(*,+/*,0/*&&#%$)'(&&''(&%%(%&$%$/,%$#$$$%%%&*(%&4595:;8:;9+-69')(-'''('%#"-1+'(&%%&%#%&$"#%'&%&**,%%%'')''2:6455.((-)'%%/.%##%%%/../3&%'*(+*&&&%$#(*+.&&&)**+)(102221/++($$45%%%.//.+)02)),32230*&%%&#$%(*()).0&&'/20&$##&&'&%'''&&''(3('),&202368,,-))),++)()2%%$(..5498::/0))-+-'&)()'*(&%'(,3438-,,50492$$*65'-'--&%'%%''''014..*+*1;60&(45/--))'%###&&%$$'$$(&'(,'$#&-)''(,2&%'+)&%$%%%'*()(,+('''$##'$##%((%&&%)(,'(&%%'((#$%$(&&1./4/-.&&&2'%(.,)%&$$$#0)&#('&%&%&&&&',.*$$%&&$$'..'+&('##%'%$#$#$#$-)%$%$%&+&$%-1*&'$$#&*'%$%#$(.'(''$%++&($$$()(#%$%##$##&'%$&(11..(''-)($$$%$$#%%%%$%'-((*,&%%('%####%%#$&&%&'-$$)*0)'*,&%'$$'(-))+)&'((%%&%#%%)%%&&###%%$$&(+(&$$$"""#$*%$##&%%&&%####,''--,/48)))()%##$&'*(,'%&''%,,-.''$$##&()&%%(&&,)'&(.'&%&'%%##%10))##'($$%&''*($$#$%'%##$&##$((+&'%*)+('***$#%%%'%$$''%#%&'%&'&#$$&#$%&*/'((%+()('')+'$##'(*(*($(-/-+,'$$-41+*%)1'**)/0/0++%%(,)(('./6521%$$&)'(%%$%$)*(*,&&'''%&%&)())*&&&,&%&..+(((4+*'()(%#'%#"#$%&($'%%&(%#&+%%&&####$$%'((&%$$&%&%$$%$$%(%%##$$(2+&$',&'',/-%%&)'%%%&'&((.3-*&$'()'&%$-)('()&$#'%'$%%'('*-&(+('))%&)**+.,**.0/++.6+**,+-,-.''()(%%#(%+'&&)&&/&*//,,)**+&''%&&+*+%$*)%&*()(,*+'$$(&$#$'-)((&%%&$%%$#$#$$()*,(*'$$###$$#&###%$##&%%%))'''%%$$$'&%'%$#$
@600c63d9-ee9d-4529-99bb-797c013f9baf
TCACTTGGCCTTCAGCAGAAACCTGAAAAACCACACCACCATTTTCTCCGAGTGGTATGCCTGATATCATGAAGGAGCCGGGATGGCGCGGTGGTGGGTTGTACCGCACTGTGGTCGAAACCGGTGAAGTGGTTTATTTTCAAGCCCGCGCTACCGTGCTGGCAACTGGCGAGGGCGGGCTGTTGCTGGTCCCACCACCGACGCCGCGTTTGACACCAGCGACGGTCGGCATGGCTATCCGTGCCGGCGTACCGTGCCGGGTATGGAAATGTGGCAGTTCCACCGACCGGCATTGCCGGTGCGGGCGTACTGGTCACCGAAAGCGTGCCGTGGTGAAGGCGGTTATCTGCTGAACAAACGTGGCGAACGTTTTATAGGAGCGTTATGCGCAAACGCGGGAAGAACCTGGCGGGCCGTGACGTGGTTGCGGCTCATCATGATCGAAATCCGTGAAGGTCGCGGCTGTGATAGTACGTGGGGGCCACGCCCAGAAACTGAAACTCGATCACCTGGATTAAAGAAGTTCTCAAGATCCGTCTGCCGGGTATCCTGGCGAGCTTTACCGTACCTTCGCTCGCGTCGATCCGGTGAAAGAGCCGATTCCGGTTATCCCAACCTGTCACTACATGATGGGCGGTATTCCAACCAAAGTTTACCGGTCAGGCACTGACTGTGAATGAGAAAGGCGAAGATGTGGTTGTTCCGGGACTGTTGCCGTTGGTGAAATCGCTGTCTATCGGTACGCGGCGCTAACCGTATGGGCGGCAACTCGCTGCTGGACC
+
2,$$%5*$%%&'((+(53151056''*:??;7776:=788711;<87'</--:87832*)$$''-++'%%&,,*&(&%'$$()('''*+&)3365008344S?>B@>223>=)$$'+*1552469><<;;54245,,--3(&'*+2<(('10'&'+432,*(&$#&((45+(*-*+-,('%%'&%,-,**'.1''(('''&'65/+*)-,3-.1&&&)('*&''*++,2=5426465<>?>>543-*+,+%&()&%$('(+))('-68<832+*&))*537,-.0/138:?<99:<:0348812:<++&')*++(&&&.,/,-$%/''&'+467::;>?>=?>>>H;:8::**22'%))%'*2:%%8?8667923&'+0,+('(22)0/4#%)),+&&&'+1,-'-8<?@11CG:**,0412--.0-($$#$$%()*-.//)((4==644AB><<::40.)1;H:9654('+&$#%2110/.1/.)))++)'784926C8//)(*6.,+,=)((*3**)*&.*%&*.,)')((.5))(.*++11B@65788('&&'(&',-%%(3.))*)1(138.))*6+++:8<;>@@BIES>:<958662;679=:<9:=//7<?><.8CB@B444@@98FH77642113&&&0.(()('('',/6<99<C?424==<>653)3.'%%'258===;BCD>:3.54710-++()12/4/.,*),3335::77<76&$$%$%$'(0'',3..%%(('**+''(-))&('-DA<<;1028510&.01''',,
@6aafef91-8d14-4d55-a0bc-c6839a8d9fd2
TGTTGAAAGATGGGATCTGCCGGATTCAATCGGCTCTCGCAAGGCATCCGTGTAGCCGTTCAATTCAGGTAGCTCATCAAGAAAAAGCACACCATTATGCGCCAGCGAAATTTCACCGGGCCCTGGAATTGCGCCACCGCCTTCATCGCAGATTAACGATTGCACTGTGATGGTCAGCGGAACGGGCGCTGCCGCCATTGTTTTTGTACTGATTCAGCATTACCAACTTCATATCGCAGCACTCTAAGTGCCTCTTCATTGCTTAAATCTGGCAATCGGCGTGATACCGGCTGGCGGCATTGCTCTTACCTGTTCCGGCGGCCCAATAGGTAAAAGGGTACAGCCACAGCGGATAATTTCCAGTCCTGGCTTTCCTTGTTACCTAACGATAACATCACTGAGATCATCTTGTAGCACCGGATACTGCATAGAAACCCATTTCGGCATTGAGGAGGCATACCTCCAGAAACACAAGAACAACTTGCAGATAATCGCTATAGGCATACCTCCGCCGTTTAATTAGCCCCACTTCATCTTCGTTATCTTCACCGACGATAATTTTTCTGCCCCGACTTAATAGCTTCAGTTGCACTGGAGATTGCACCGGGAACGCCGCCGAGCGCTGTTAAGCGCCAGTTCTCCGACTAATTCATATTCATCTAACTTATTGGCTGTAAGCTGTTCTGAGGCCGCCAGCAACGCAATGGCGATGGGTGAATCATATCGTACCCTCTTGGCAGATGGGCTGAACCAGGTTGATGGTGATTTTTTTCGCCGGATATTCATATCCATTTGATAATGGCGCTGCGCACGCGATGCGCAAGCTTCTTCATCTGGTAAGCCCACCATCGTTAAGCCGGTAAGCCTTTACTGATATGTACCTCAACAGTGATGGGGGCGCATTATTCCAGGGCTGCGCGGTTATGAACAATTGACAGTGACATAAGCCCTCCTCGTCACCATTATGTGCATAAGGATCTCGCTGCTGTAGCCCGCTAATTCGTGAATTTTAGTGGCTCATTCCTGTTTATTTGTGCAAGTGAAGTTCGTGTTCTGGCGGTGGAATGATGCTCGCAAAATAACGACAAAGGATCAACTACAAGGAACAACATAATTCTGAAAATAAATTTTTTCCACTTCACTTATTTATTTTTAAAAAACAACAATTTATATTGAAATTATTAAAACACGTTCATAAAAATCGGCCAAAAAATATCTTGTACTGTACAAAACCTATGGTAACTCTAGGCATTCCTAGAACAAGTGCAAGAAAAAGCAAAATGACAGCCCTTCTACGAGTGATTAGCCTGGTCGTGATTAGCGTGGTGGTGATTATATCCACCGTGCGGGGCTGGCTGGACGAGGAAAGGCTTAAGATCAAGCCTAAGCGACTAGAGCCCGCACCGAAAGGTGCAGATTTTGACCTTAAAAGCATAACCGAGAGCAGACAATGAATAACAGCACAAATTCTGTTTCTCAGTTCAGGAGCGGGGAACTAACTATGAATGGCGCACAGTGGGTAGTGTCAGTGCGTTACGACGGGTATAAACACGTTCGGTTATCCGGGTGGCGCAATTTGTGCCGGTTACGATGCATTGTATGACGGCGGCGTGAGCACTTGCTATGCCGACATGAGCGGGTGCGGCAATGGCGGCTATGGGTTATGCTCGTGCTACCGGCAAAACTGGCATATATACGCCACGTCTCGTCCCGGCGCAACCAACCTGATAACCGGGCTTGCGGACACACTGTTAGATTCAGTCCCTCTTGTTGCCATCACCGGTCAAGTGTCCGCACCGTTTATCGGCACTGACGCATTTCAGGAAGTGGATGTCCTCCGGTGTTGGTGCACCTCTACCAGCACAGCTTTCTGGTGCAGTCGCTGGAAGGTTGCCGCGCATCATGGCTGAAGCATTCGACGTTGCCTGCTCAGGTCGTCCTAAGTCCAGGTTCTGGTCGATATCCCGAAAAGATATCCAGTTAGCCAGCGGTGACCTGAAAGCGTGGTTCACCACCGTTGAAAACGAAGTGACTTCCCACATGCCGAAGTTGAGCAAGCGCGCCAGATGCTGGCAAAAGCGCAAAAACCGATGCTCTGCGTTGGCGGTGGCGTGGGTTGTCGGCAGTTCCGGCTTTCGTGAATTTCTCGCTGCCACAAAAATGCCTGCCACCTGTACATCTGAAGGTCCGACGCGGTCGAAGCGGTTCTCCGTTACTATCTGGGCATGCTGGGGATGCACGGCACCGAAGCGGCAAACCGCGGTGCAGGAGTGTGACCTGCTGATCGCCGTGGGCACACGTTTTGATGACCGGTAACCGGCAAACTGAACACCTTCGCGCCACACGCCAGTGTTATCCGTATGGATATCGACCCGGCAGAAATGAACAAGCTGCGTCAGGCACATGGCATTACAAGGTGATTTAAATGCTCTGTTACCAGCATTACAGCAGCCGTTAAATCAATGACTGGCAGCAACACTCCGCGCAGCTGCGTGATGAAGAACATCCTGGCGTTCGACCATCCCGGTGACGCTATCTACGCGCCGTTGTTGTAAAACAACTGTAGGGATCGTAAACCTGCAGGATTGCGTCGTGACCACGGTGTGGGGCGGCACCAGATGTGGCTGCGCAGCGCATCGCCCACACTCGCCGGGAAAATTTCATCACCTCCAGCGGTTTAGGGTCCATGGGTTTTGGTTACCAGGCGGCGGTTGGCGCACCAGTCGCGACCGAACGATACCGTTGTCTGTATCTCCGGTGACAGCTCTTCATGATGAATGTGCAAGAGCTGGGCACCGTAAAACGCAAGCAGTTACCGTTGAAAATCG
+
+,,)&1900.-''))'&'%%&''3:;2../0,-2B21&$%&''0/)&'2+(+))%&+++)())-0.+039,+,,,*+*,,3-<3999:22%&&&&'.88;455244/-17459;;>@33<@DSA@6;952(''),,14799D9'&&&,-)))(##$$&())'('*+7://0++6(''-(&'+'*1==***763..-%$%%&&)-852211---.()*+'&(&'+-++/1--.&&&**')))&&)+%"#&&$$%%,.'7:?:31-&))0..%%&$%%$"$%"#%((%%(',-96440(&&,*''$$%)'&#&%&'''/03***A=-+/*'1*..0765/*&%#&%+.)#&&)0+(+2222,++'+)16%%%'0/--,0(())().'&$#%%&'&$##%&(%$&(..-.(('##$&&&+,0031.-*'&$'%%%)('''%'*-'$&0('''&+(&'(%%%#%$%#$%'%,,'$&&(&&)))%$%/-,,./''(*&$#.**&(.///)(&)#$##"##(.-0,,*,-61898762456;3-++.,,,.34)(&%'0168336<==C12054-=5313*&*-0./75025.;9=10189:3++(+-&&'*.33.///*&+,$$'**-+-/+)0(&'23-,44155>9::>9==/-430201545;:>>BD<;:?GABAA<9;?=/./2523;??G6340.-./000/35;<=999+*3;''33*%&'(&&&(-.25+74.DB;9/-$$+.<.%-+0((/:=:;<435557:;;BB:-,,922:C9:92/.)*('&$%(%%%%(*259///897989*+*32+)&&'))*+,)*.%$#"%%)(&$&%+8::;//(&&4//062-783*3**9:99;::;=<<=@>>;::<>@867544))9A77))%&,)%%'**26BBD775:6<?=;=((11)),(((3...957++&'(11;=206*(()5676123413'&'+'(&'''&'&+1::99;>976@B=:>99;>*)),+0.-10&%%*,/9*)*+-4424368;AC<><C?@AFB@=<00&%&,)))*).688:8006-534697+((%%)/*%&$$&'*4;9202%$%*/'($%%#$',,--))%$&(#$,.393/'(61037691*%'(&%)+*.&'(.//2442)(.9;70+08../<</,(44458>95&%(/,.32//%%*+/0,6;>>=)(-255.36634))''(&++21'%%&&27324620345/,))(0&46>///457&&&340,,&'32,/7915,+'&$+.,&%%#$('-.('358677;8687755;<588500./0.+,-248982,,+,,%%%%((59<;62.-&'())-'"#+&*++**+24004803+&%&./'%%'-10)3*'&&(-'$(*&'>51+**,,26754**'&&%&*+''(/52).153))0+,-10+6(14459:112/,-.''(*(()4./2./.%%%%.))%$#$'%)*0*,&&.:@@<;:689?::''*)*)+,-&%#%%,-41$&&,.(%&$/5,--+*&$$(+78,)&'.2((%&(*%&(+((%%#&*(''%'&'%%*-'''335323.'%&)/0*+*+5,,,98.--0./0.2).0056<6E>==CDB43366,+*+'*8622,-/0/1((*88&&&2''=<2+*+22-*+/41--,,.5429735)..&$$&#$$'(/1752723/01((+%%*)'&%'2&&+;@8>@<=>445@>>@/.'(')&)()*5.(%'/.*($&*1;?=))*1226?AA?<:;9:>;;667I=?:B21/120+2700-+,4666654/.*+,3&&+204.018:864../'%%($#%$%)*('&'';(().+*%/0/*2+*;;:=++7II:;;1.)()(*141)-*)*1203/1179'&'B75965/003%%%)((%%%&(--)((,9866012$#%&&&&))0766/015'&%//''.*(1A<<;5/78>::87.--,%&0557876442(3)(%$*,0574-56>=:23::2@@?954>;<222:73=A;;;??CD97CB@7+**&&()3)('1))++1244.,-45@:72))%+43/085230-,$%#$#+.,6>ABD@??@?::@4/*+((*.'%&)'*-/9@AA'',.-1/..03)**097:000;AB@@;:<?DD?322C:70+#""$$$*(&#$('%&%%$&))&&+,,&&$'%'#$'*+&%'*,'(*,IE632668<S@;:41./,.2)()(&2+'',-*)6)),5,,5<>8;;:=<><?>>?B@AA6<;5//266((++*(&&&'+*++876::117:&%'87).((9115;??666C1+01789>>;?:::0//<599D54)()669>><<21135B<<<;<@BCD:<@@CCCCDAA=>??=?>85..)00%$$&')217?656;<;=?@?>>:<<><9;=3356333203;995@>;?>>??;8:=112<69E?>?))+++.'''63=8>>N@=880//31/..+%'&$56..100.'&+5898554)()677458,,+,+/(''///..021,+)27BA**,80.%%&,+*($$%'*'(.((%&'&+*))*1100323@97*&%)/35:;<)()(''165569,-2A?=67,/0)**20)()5*)&'*&&&4**164430-..38>;==<1979;7763/+*,58()+*)->43>C6'(0++(&')<<<=><<4116;:))())//(-.12644751))--.,,+(*'((-*++/.0.19566$$%3&2)'3<778??BJ@>>?16998560068)'(-228==6-,-9;?999<;<833766;:96'%

Seems like there aren't that many reads missing, but I'm still interested in why.

hasindu2008 commented 8 months ago

This is the read splitting, now I am more sure. Can you please get one of the read IDs in in_bam_not_inblow5.list and do a grep as:

samtools view calls.bam | grep "readid" 

and see if this "pi" auxiliary tag for parent read ID is present for that read?

If so, we need to locate the parent read IDs for the split reads. For Guppy, the strategy I used is explained under demultiplexing here. You can check the FASTQ generated by Dorado for those split reads to see if they provide a similar tag?

Otherwise, I can think of a strategy to extract this through the BAM file using some bash commands, which would require us to use the BAM file tags.

Another solution is using buttery-eel wrapper on top of the ONT's Dorado basecall server. ONT's Dorado basecall server can be downloaded from the ONT software page (this is the version of Dorado they use in MinKNOW for live basecalling). Then, you can buttery-eel on the BLOW5 directly like:

buttery-eel  -g /path/to/ont-dorado/bin --config dna_r10.4.1_e8.2_400bps_5khz_fast.cfg --device 'cuda:all' -i  can_pod5_merge.blow5 -o  reads.fastq --port 5555  --use_tcp --barcode_kits SQK-NBD114-24 --do_read_splitting

This will create a file called barcode_summary.txt which has a proper format like here, and the parent_read_id and barcode_arrangement columns can be used to easily generate the read_ids.list to be fed to slow5tools.

jorisbalc commented 8 months ago

Yes, the pi tag is there. Sems like it was the read splitting. I'll follow the workflow to grab the parent read ids as described under the demultiplexing strategy. I'll make sure to re-open if anything comes up. Thank you for the great tool!