nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
156 stars 20 forks source link

data prep #191

Closed Jeff-Field closed 16 hours ago

Jeff-Field commented 1 week ago

I am writing to continue an earlier thread. After a long learning curve we have successfully used Remora for reading mods. however we ran into a new issue. I believe this was caused by either updates to Remora or Dorado over the last few months. previously we had a nicely aligned dataset that came out of Dorado that we fed into Remora. But recently the data coming out of Dorado has has a lot of unwanted sequences in it. these appear as question marks. I believe that these are failed reads or poorly aligned sequences. They disappear from view when we set IGV to use a mapping quality threshold score >1. These were not in our original BAM files. But after updates to (not sure as it could be MinKnow or Dorado) we started seeing these in the BAM files. In Minnow it could be an update that stopped our system from doing live basecalling, or one removed the 200 bp cutoff we previously used. Alternatively, it could be tweaks to the Dorado aligner. i am unsure if these will now reduce our training in Remora.

my question is, Is it necessary to remove low quality map alignments for training? If so, how can I remove them. my preference is to remove them using Dorado. But can they be removed in another program such as Sam tools or pod5 tools?

marcus1487 commented 5 days ago

Filtering out spurious mappings can certainly help training a high quality model. Not knowing enough about your training data I can't make specific recommendations. If your training data are close enough to our internal data types I can recommend filtering criteria. If you can expand on the type of data you are using for training I can try to make a more sound recommendation, though I would note that the ultimate goal is a high quality final model and there is no substitute for testing filter thresholds in your setting and validating on a gold standard dataset.

Jeff-Field commented 4 days ago

We have since been able to remove these reads with SAM tools, but I think that there were some changes in Dorado so that unmapped reads or reads with 0 mapping quality scores are no longer removed.

marcus1487 commented 16 hours ago

I would suggset the Dorado change log or logging an issue with Dorado for any changes on the basecaller side. It sounds like this issue is resolved from the Remora side, so I'll close this, but feel free to re-open this issue if you have further questions.