skovaka / UNCALLED

Raw nanopore signal mapper that enables real-time targeted sequencing
MIT License
520 stars 44 forks source link

Read lengths in .paf files #35

Closed chAwater closed 3 years ago

chAwater commented 3 years ago

Hi,

I found a "weird" result in .paf files

Is that a misplaced property for real-time mode? Where to find the "real" read or event length?


For example:

uncalled map -t 16 E.coli my.fast5 > uncalled_out.paf
head -n 4 uncalled_out.paf

Output:

b84a48f0-9e86-47ef-9d20-38a0bded478e    256     125     256     -    ...
77fe7f8c-32d6-4789-9d62-41ff482cf890    92      66      92      -    ...
eee4b762-25dd-4d4a-8a59-be47065029be    4843    *       *       *    ...
e175c87b-a426-4a3f-8dc1-8e7ab5fdd30d    192     143     192     +    ...
skovaka commented 3 years ago

Whoops, you're right! Awhile ago I made the map PAF output behave similar to realtime after some code refactoring (i.e. only include how much the read was considered), and forgot to update the README to reflect that. To be honest, none of those read length/position estimates are very accurate since they assume the DNA moves at a constant 450bp/sec, while the average is closer to 400. I plan to improve that in the future, but just went with the easiest solution for now.

You can compute what I used to put in that field by multiplying "duration" column by 450 in the "sequencing_summary.txt" file output by basecallers. A better estimate could be based on "template_duration", which trims a bit of un-basecalled signal.

Would it be helpful to you if I made the read length field estimate based on the full read again? It's not too difficult to fix, but I didn't think it was worth complicating the code since it's such a rough estimate.

skovaka commented 3 years ago

I've updated the README to reflect how it actually works, but still open to suggestions.

chAwater commented 3 years ago

Thank you for the quick respond and update in https://github.com/skovaka/UNCALLED/commit/0fc1cab738b305bdb2f2b647312f00baac642d3b . It completely solve my question.

I think the estimation is good enough for now, and I have no idea how to improve this. But I really like the idea of realtime (both in 3rd-gen. sequencing and UNCALLED). So just let it be! :smile: