Similar database sequences leading to ambiguous mappings

AShaw1802 commented 8 months ago

I think we may be seeing a case in Pakistan where wt1 reads are mapping to sequences too similar in the reference database, so are being assigned to ambiguous mapping (I'm trying to get the raw data now to share). Is there a measure of how similar the reference sequences can be before it's an issue? We can screen the current database, but people may add their own sequences in the future- is there a way that Piranha could cope with similar references?

aineniamh commented 7 months ago

Notes:

Current approach:

In minimap2 command we use --secondary=no to suppress secondary alignments (aka multiple mappings), but this does not suppress supplementary alignment (aka split or chimeric alignment).
Similar to how RAMPART handles what we call ambiguous mappings, where a read maps to multiple references in the database, we currently filter these out as there is no clear signal for a specific reference.
This works well when dealing with Sabin sequences, as there is a single reference for each category
Problems arise when dealing with wild-polio and non-polio enteroviruses, where the represenation of diversity in the database may be poor and a given sample may be equally diverged from multiple references in the database
These cases also raise questions about mapping quality thresholds. Upon examining some recent WPV1 data, the mapping qualities can be very low because of divergence.

Plan

We propse to change the mapping parsing steps as such:

Ambiguous mappings will be included, with the longest alignment block hit used as the hit (may not equate the highest mapping quality)
We will note which references are cross mapping and document/ report this, as it suggests this section of the tree is poorly represented
We will also lower the mapping quality threshold as this appears to be problematic and eliminating diverged references incorrectly
With this lower threshold, it may be worth reporting that some of the hits may be lower quality mappings
This also raises a question of whether de novo assembly may be a more appropriate approach for these diverged sequences, although this would likely come with a greater RAM requirement

aineniamh commented 7 months ago

https://github.com/polio-nanopore/piranha/pull/222

Dev work on this issue continues- more permissive paf parsing will now raise issue of requiring masking of regions that do not have good coverage because of mapping failure.

aineniamh commented 1 month ago

This is now resolved on main.

polio-nanopore / piranha

Similar database sequences leading to ambiguous mappings #219

Notes:

Plan