iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Selecting mapping instance(s) for a read which maps to multiple positions #90

Closed ffranr closed 6 years ago

ffranr commented 6 years ago

Quasimapping gives multiple mapping options

A read maps to 7 "positions" within the PRG:

  1. non-variant region 1
  2. non-variant region 2
  3. site 2
  4. site 2 + site 3 + site 4
  5. site 1 (within allele 1 twice or more)
  6. site 5 (completely encapsulated within allele 1 and completely encapsulated within allele 2)
  7. site 6 (completely encapsulated within allele 1, completely encapsulated within allele 2, and not completely encapsulated within allele 3).

A single option is chosen from [1, 7] via uniform random selection.


Handling Selected Option

  1. Move on to the next read without modifying coverage information.
  2. Move on to the next read without modifying coverage information.
  3. Record coverage information for all relevant alleles.
  4. Record coverage information for all relevant alleles.
  5. Uniform random selection on options within allele 1, followed by recording coverage information for all relevant alleles.
  6. Record coverage information for all relevant alleles.
  7. Record coverage information for all relevant alleles.
ffranr commented 6 years ago

@martinghunt @iqbal-lab How does that sound?

martinghunt commented 6 years ago

Fine with me. So long as we're happy that there are two positions in 5, but uniform random selection is on [1,2,3,4,5], halving the probability of the two positions within site 1 allele 1?

iqbal-lab commented 6 years ago

I'm ok with this, but for option 5, shouldn't it read:

Uniform random selection on options within allele 1, followed by Record coverage information for all relevant alleles

iqbal-lab commented 6 years ago

Does that make sense?

ffranr commented 6 years ago

@iqbal-lab OK, updated point 5.

iqbal-lab commented 6 years ago

Cool, just checking i had not misunderstood anything

ffranr commented 6 years ago

@martinghunt This work should be completed with this commit: https://github.com/iqbal-lab-org/gramtools/commit/47e41d7691954eea17b195e49bbe39729e389f7d

This commit also contributed to correctly implementing this issue: https://github.com/iqbal-lab-org/gramtools/commit/2c776c2a6633fdcaaebce33e47f60f460ffb418b

If there are further changes to be made, let's keep them to this issue.

iqbal-lab commented 6 years ago

To discuss after Easter, I think our handling of item 4 is not right. Right thing would be choose the site where the mate read maps closest, but that's an enhancement. Not sure what right thing is

ffranr commented 6 years ago

@martinghunt @iqbal-lab I'm currently handling this issue: quasimap Assertion `index_end_boundary >= allele_coverage_offset' failed I understand why the error occurs and I've implemented a partial solution. However, that issue has lead me back to this issue.

I'm uncertain about how to handle point 6 and point 7 (see first post in this issue: "Quasimapping gives multiple mapping options").

Please help me understand how I should deal with this.

iqbal-lab commented 6 years ago

OK, so suppose we now have

A read maps to SEVEN "positions" within the PRG:

  1. non-variant region 1
  2. non-variant region 2
  3. site 2
  4. site 2 + site 3 + site 4
  5. site 1 (within allele 1 twice or more)
  6. site 5 (completely encapsulated within allele 1 and completely encapsulated within allele 2)
  7. site 6 (completely encapsulated within allele 1, completely encapsulated within allele 2, and not completely encapsulated within allele 3).

A single option is chosen from [1, 7] via uniform random selection. Once you have chosen, deal with it as follows:

For 1..5 this is identical to above

  1. Move on to the next read without modifying coverage information.
  2. Move on to the next read without modifying coverage information.
  3. Record coverage information for all relevant alleles.
  4. Record coverage information for all relevant alleles.
  5. Uniform random selection on options within allele 1, followed by recording coverage information for all relevant alleles.
  6. Record coverage information for all relevant alleles.
  7. Record coverage information for all relevant alleles.

Completely within an allele is the same as partially overlapping an allele. Record the per-base coverage, and the equivalence-classes/partitions as normal

ffranr commented 6 years ago

@iqbal-lab Thanks! I've updated the first post.

iqbal-lab commented 6 years ago

Sounds good - does this need to be open?