Some reads in ```bam``` file produced with ```dorado``` not found in source ```pod5``` file?

drivenbyentropy commented 9 months ago

Hello,

I am facing a strange issue that I am not able to solve on my own and would appreciate any insight as to what I might be missing here. I have a dataset of reads from a plasmid spiked with other, distinct DNA that was sequenced on a PromethION P2 Solo. Relevant details regarding this run are as follows:

Sample Id: Plasmid
Barcoding Enabled: False
Experiment Duration Set: 4320
Experiment Type: genomic_dna
Sample Rate: 4000
Selected bases per second: 260
Sequencer Position and Type: P2S_00177-2 (p2_solo)
Flow Cell Product Code: FLO-PRO114M
Sequencing Kit sqk-lsk114
Protocol Name sequencing/sequencing_PRO114_DNA_e8_2_260T:FLO-PRO114M:SQK-LSK114:260

I am basecalling and aligning the data from a single pod5 file (no other pod5 files are present in the same folder or subfolder) to produce a sorted bam file with dorado v0.4.2 as follows:

dorado basecaller --reference path/to/plasmid.fa \
             --device cuda:all \
             --emit-moves \
              path/to/models/dna_r10.4.1_e8.2_260bps_sup@v4.1.0 \
             /path/to/pod5/ \
              > plasmid.bam;
samtools sort -o plasmid_sorted.bam plasmid.bam; \
samtools index plasmid_sorted.bam;

However, when trying to create pairs of matching raw reads and bam reads, it appears that not all reads present in the bam file can be found in the pod5:

import pod5  #version 0.3.0
import pysam #version 0.21.0

can_pod5_fh = pod5.Reader("/path/to/pod5/output.pod5")
can_bam_fh = pysam.AlignmentFile("/path/to/plasmid_sorted.bam")

#get all primary mappers in the bam file
bam_read_ids = set([read.query_name for read in can_bam_fh.fetch() if not (read.is_supplementary or read.is_secondary)])
print(f"Total number of primary bam reads {len(bam_read_ids)}")

#get all ids of the pod5 reads
pod5_read_ids = set()
for pod5_read in can_pod5_fh.reads():
    pod5_read_ids.add(str(pod5_read.read_id))
print(f"Obtained a total of {len(pod5_read_ids)} pod5 reads")

#compute the intersection between the reads
intersection_read_ids = bam_read_ids.intersection(pod5_read_ids)
print(f"Intersection size is {len(intersection_read_ids)}")

Total number of primary bam reads 1384732
Obtained a total of 11578864 pod5 reads
Intersection size is 1376053

My expectation would be that the intersection size between the bam ids and the pod5 ids corresponds to the number of bam ids. What could I be missing here that could cause such a scenario? Where could the additional bam ids be coming from? What approach could I take to debug this issue further?

Thank you very much in advance for your advice!

tijyojwad commented 9 months ago

Hi @drivenbyentropy - this is mainly caused by read splitting in dorado. The parent read id of the split read will correspond to the original entry in the pod5. Here's some info on split read specific tags - https://github.com/nanoporetech/dorado/blob/master/documentation/SAM.md#split-read-tags

drivenbyentropy commented 9 months ago

@tijyojwad Thank you very much for your prompt answer, this is exactly what I was missing. I encountered this issue when working with the remora API which appears not to support split read input and calls can_pod5_fh.reads() with missing_ok=False hence failing if any of the bam reads were split. I will open a separate ticket over there for this issue.

Thank you very much again for pointing me towards the solution!

nanoporetech / dorado

Some reads in ```bam``` file produced with ```dorado``` not found in source ```pod5``` file? #476