gui11aume / mmp

MEM mapper prototype
13 stars 1 forks source link

Decide the seeding strategy (MEM or skip) before seeding #5

Closed gui11aume closed 5 years ago

gui11aume commented 5 years ago

We can estimate the mapping quality before mapping, so we can switch from the default MEM seeds to the more sensitive skip seeds to maintain the mapping quality above a certain level.

This can be achieved by running the quality() function on the read before anything else happens. We may have to change the logic of the function accordingly.

gui11aume commented 5 years ago

Initial tests revealed that choosing before seeding is less efficient mapping first and trying again if the quality is too low. The reasons are as follows:

  1. Estimating the mapping quality on the read is less accurate. The read can be "damaged", or it may not even belong to the genome, in which case the estimates will be meaningless.
  2. Estimating the mapping quality takes almost as much time as mapping, so doing it twice on every read is slow (before and after mapping).

The empirical workflow that was retained is the following:

  1. Map the read with MEM seeds.
  2. Estimate the mapping quality.
  3. If the quality is more than 40 or less than 20, print output and stop.
  4. Otherwise, remap the read with skip-8 seeds.
  5. Estimate the new mapping quality.
  6. Print output and stop.

The reason for not re-mapping the read when the quality is less than 20 is that the target is probably a strongly repeated sequence and changing the seeding method will not improve the result (the mapping quality will remain low). In Drosophila, approximately 5% of the reads are mapped twice, so the time penalty is low. On the other hand, the benefit is also low.