nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

Dorado creates new reads? #673

Closed mbhall88 closed 6 months ago

mbhall88 commented 6 months ago

Issue Report

Please describe the issue:

I extracted the read IDs from a fastq that comes from a sup dorado basecalling run (v0.5.0). I then passed those read ids to pod5 subset and pointed pod5 at the pod5s I basecalled and there are read ids from the fastq that do not exist in the pod5s. And indeed, when I inspect the pod5 file a particular fastq read comes from (using the fn:Z:<fname> tag) that read id doesn't actually exist in there...Is this a known thing?

In the runs I am working with there are 2494275/34092978 (7.3%) fastq reads with no associated pod5 read.

Run environment:

tijyojwad commented 6 months ago

Yes this can happen because of read splitting. The original reads from pod5 are split into subread which have different read ids. However each of those records will have a pi:Z tag which point to the original read id they came from.

esteinig commented 6 months ago

Could you explain what will cause a Pod5 read to be split?

tijyojwad commented 6 months ago

some detail here - https://github.com/nanoporetech/dorado/blob/release-v0.5.3/documentation/SAM.md#split-read-tags