nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

dorado basecaller making up read IDs #848

Closed JWDebler closed 4 months ago

JWDebler commented 4 months ago

I am currently working on a little script to automate basecalling and simplex / duplex separation etc.

When testing on a small subset of pod5 files I realised that during simplex calling and barcode demultiplexing I ended up with a few lonely reads for some barcodes. After looking for them in the original pod5 files it turns out those read IDs don't exist in them.

I'm a little confused as to where they come from because the pod5 file that is mentioned in the bam entry does not contain anything similar to that read-id.

Btw, this is on dorado 0.7.1-RC1

dorado basecaller sup -r pod5s/ --min-qscore 10 --kit-name SQK-NBD114-24 > all.bam

dorado demux --output-dir demuxed --no-classify all.bam

samtools view -h demuxed/SQK-NBD114-24_barcode05.bam

@HD VN:1.6  SO:unknown
@PG ID:basecaller   PN:dorado   VN:0.7.0+ed2dda1    CL:dorado basecaller sup -r /mnt/sdd/pod5/ --min-qscore 10 --kit-name SQK-NBD114-24 DS:gpu:Tesla T4
@PG ID:demux    PP:basecaller   PN:dorado   VN:0.7.0+ed2dda1
@PG ID:samtools PN:samtools PP:demux    VN:1.20 CL:samtools view -h SQK-NBD114-24_barcode05.bam
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode01  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:CACAAAGACACCGACAACTTTCTT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode02  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:ACAGACGACTACAAACGGAATCGA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode03  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:CCTGGTAACTGGGACACAAGACTC
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode04  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:TAGGGAAACACGATAGAATCCGAA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode05  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AAGGTTACACAAACCCTGGACAAG
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode06  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:GACTACTTTCTGCCTTTGCGAGAA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode07  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AAGGATTCATTCCCACGGTAACAC
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode08  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:ACGTAACTTGGTTTGTTCCCTGAA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode09  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AACCAAGACTCGCTGTGCCTAGTT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode10  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:GAGAGGACAAAGGTTTCAACGCTT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode11  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:TCCATTCCCTCCGATAGATGAAAC
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode12  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:TCCGATTCTGCTTCTTTCTACCTG
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode13  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AGAACGACTTCCATACTCGTGTGA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode14  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AACGAGTCTCTTGGGACCCATAGA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode15  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:AGGTCTACCTCGCTAACACCACTG
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode16  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:CGTCAACTGACAGTGGTTCGTACT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode17  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:ACCCTCCAGGAAAGTACCTCTGAT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode18  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:CCAAACCCAACAACCTAGATAGGC
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode19  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:GTTCCTCGTGCAGTGTCAAGAGAT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode20  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:TTGCGTCCTGTTACGAGAACTCAT
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode21  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:GAGCCTCTCATTGTCCGTTCTCTA
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode22  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:ACCACTGCCATGTATCAAAGTACG
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode23  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:CTTACTACCCAGTGAACCTCCTCG
@RG ID:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode24  PU:PAS00041 PM:yoshas_computer  DT:2024-03-03T01:44:27.408+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 runid=982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5 LB:no_sample    SM:no_sample    BC:GCATAGTTCTGCATGATGGGTTAG
2bd74582-be31-445a-ac93-a89fa5f3cb97    4   *   0   0   *   *   0   0   GCTTTTCGCGGCGTGGTAGGAATGATGCACCGCAGGTCAGCTGGAAAGCGTCAAACTCCTGCTTGGCGCAGGTGTGAGTGTAGAGGATAAGAACTGTGGTGTGTTCTCCCTTTTGACAACGGCAATTCGCGAGGGGCACAGGAGTATTGTGCGCTATCTTCTCGACGAAGCAAATGCGGACCCCAATGCACCGGGTGAACACCTCCCTCTTGTCAAGGCGCTACGCAGCTACGACGGCACCAACACGGAAGTCATTCAGATGTTGTTGAGCAGGGGCGCGGACATTAACAGGATGCACAGGGGGTGGAACTCTGTGCTTCCAGGCGGTAAGAGACCCAGCGACACTCAAGTGCTAAGTCTGCTTGTTAGTCTGGGCGGACCTGTCGACCTGCAGGCTGTCGACGAAACTGGGCGGCCAGTGATTGACATTGTTGGCGATCGTGATTGGGAGGAGGGTCTTGCGCTGCTCTTTCCTAAATCTGCCTCCTCGCGGGCTAGACAATGAGTCCGTTTGGCGTGGTTGTTGTTTTCTCTCATTGCTGTTCTTTTACAGACGATCTTTTAGTTT    JFGC=:)?DCADCB;::01-,,+****()))***;;>:;4334@EJSSSSS978IFHGIHLFDA85000BLEC?A222228A?>@DEGDFEDDCCCSLGGHNDD5..134*??=,?MKLSMISLSSSKSSMSSSSSSSSNKNOSSPMSLIIHRIFB>C=.++*,-;=IKSSJPSSLSSJNKHGGGSSOLKSFBBBSILNKSSMSQKF;PGH66IFIGJSLKJMSNSOSPLSSDCBBBNGGHKKSSSORKSSSILMJNSSSSOQNPLS@@@?@ISSSSSSLNSSMMISSSSSSMJSSQS+++++@@@@EDDCAAAA@@><666@FE><:2+**30)))(()99ADDCFEESSNSSLSSSSSSSSSRSSJMJKKHNPQKRSSNPJGIHISGGLGHHSKHNSRIHHIDDABDBDDGCD>@====AD9667=HIOSSSSSSSSOSSOSRKPJMNSSMILSLSIJSSHGFHPJ=??@4CSISSSLLLLH44SSNMJMSSSSSSSSONKSSSSLSMKOSOLSNSHSLHIEEPICAA=?==-----/6:))))-43(**))*&&'&&'&'('''&    BC:Z:SQK-NBD114-24_barcode05    qs:f:17.1091    du:f:1.6992 ns:i:8496   ts:i:822    mx:i:4  ch:i:1816   st:Z:2024-03-03T02:54:01.624+00:00  rn:i:-1 fn:Z:PAS00041_pass_barcode01_0ff99174_982d5e5f_22.pod5  sm:f:-759.842   sd:f:0.00795814 sv:Z:pa dx:i:0  RG:Z:982d5e5f57a08fa66e5a06da72c6d8f3d20d1af5_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-24_barcode05    pi:Z:7e752bca-6565-4225-9fca-6220a000abcc   sp:i:141444

It lists PAS00041_pass_barcode01_0ff99174_982d5e5f_22.pod5 as the source for this read, however:

pod 5 view PAS00041_pass_barcode01_0ff99174_982d5e5f_22.pod5 | grep 2bd74582-be31-445a-ac93-a89fa5f3cb97 comes up empty, and so does pod5 view *.pod5 | grep 2bd74582-be31-445a-ac93-a89fa5f3cb97

This read ID does not exist anywhere in the pod5 files used for this test.

There are a few more like this for other barcodes with 1 read each, but granted, it was a tiny test dataset to start with.

Any idea what's going on? Cheers

malton-ont commented 4 months ago

Hi @JWDebler,

This read is a product of read splitting - dorado automatically generates new read-ids for split reads. You can see that it contains a pi:Z tag, which contains the read-id of the parent read from which this entry was generated.

JWDebler commented 4 months ago

Ok, that means since duplex calling can currently not do barcode demultiplexing (and read splitting) I am losing the reads split during the simplex run, as their read IDs don't exist in the pod5 files. I'll have to come up with a workaround to keep them until duplex can do all of that :-)