iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Regression in allele combination count storage #92

Closed iqbal-lab closed 6 years ago

iqbal-lab commented 6 years ago

Martin spotted a regression on Friday night. Current code never records >0 for more than one allele (tested on 138 samples), whereas previous commits on the same samples did.

Allele counts from that old run: {0,1,2}=>1, {1,2} => 6, {1} => 50, {2} => 1

Allele counts from current run 0 => 1, 1=>51, 2=>6

Raising this from slack channel so it's not lost. Martin is at a confernece so hasn';t had time to raise it, just sticking my nose in so it's not lost

martinghunt commented 6 years ago

Specifically, in the file grouped_allele_counts_coverage.json, every value (which is a list) in json ['grouped_allele_counts']['allele_groups'] has length 1.

ffranr commented 6 years ago

This is related to this issue: https://github.com/iqbal-lab-org/gramtools/issues/90 @martinghunt Can you give me a minimal example of this error?

martinghunt commented 6 years ago

Not really minimal, but a real one on the EBI cluster. I'll message you...

martinghunt commented 6 years ago

The example I looked into in detail had equivalence class counts: {0}:1, {1}:51, {2}:6 The code used to produce this on the same data: {0,1,2}:1, {1,2}: 6, {1}: 50, {2}: 1 ie the allele combinations are not being counted.

iqbal-lab commented 6 years ago

@martinghunt - this is on tip of master?

martinghunt commented 6 years ago
    "version_report": {
        "version_number": "0.5.0",
        "last_git_commit_hash": "47e41d7691954eea17b195e49bbe39729e389f7d",
        "current_git_branch": "HEAD",
        "truncated_git_commits": [
            "47e41d7 - Robyn Ffrancon, 24 hours ago : read mapping selection for coverage includes multiple allele encapsulated mappings and non-site mappings",
            "7dee1b7 - Robyn Ffrancon, 24 hours ago : WIP: read mapping to non-site split on SA index; allele encapsulated mappings identified",
            "14f468d - Robyn Ffrancon, 9 days ago : WIP: functionality for identifying read variant site paths when reads quasimap in their entirety to long alleles",
            "549fc5c - Robyn Ffrancon, 9 days ago : added assert to ensure kmer bases are valid for all kmers serialization",
            "97b3a49 - Robyn Ffrancon, 13 days ago : added parameters for setting the maximum thread count"
        ]
    }
ffranr commented 6 years ago

Either issue https://github.com/iqbal-lab-org/gramtools/issues/90 is wrong. Or my implementation of the schema from issue https://github.com/iqbal-lab-org/gramtools/issues/90 is wrong.

I understand that the code produces a different result after implementing issue https://github.com/iqbal-lab-org/gramtools/issues/90.

I would like a minimal example so that we can revise issue https://github.com/iqbal-lab-org/gramtools/issues/90 or fix my implementation.

martinghunt commented 6 years ago

Suppose case 3 in #90. A read maps to one site, but to alleles 4 and 5. The counter for {4,5} should get incremented, but that doesn't look like it's happening, which is a regression from implementing issue #81

iqbal-lab commented 6 years ago

i can't see case 2?

martinghunt commented 6 years ago

@iqbal-lab I edited to say case 3, not 2 (but slack doesn't show that). It's where a read maps to a single site.

ffranr commented 6 years ago

Consider a read which maps to a single site, in order to get the multiple allele counter event (example was for allele group {4,5} above) the read either needs to start or end within that site. Not only that, but the alleles within the site need have some similarity.

Given our new selection schema, I don't expect that to happen often. Do you guys agree?

Is there another case where a read completely overlaps a site and maps through two different alleles?

iqbal-lab commented 6 years ago

@martinghunt 's example is one where multiple alleles agree at the start ( or end) so a single read which mostly lies outside the site, overlaps two alleles at the end

iqbal-lab commented 6 years ago

Martin - could you stick the site and the pileup in?

iqbal-lab commented 6 years ago

We're not talking about a read covering a whole site and a second allele

iqbal-lab commented 6 years ago

"Given our new selection schema, I don't expect that to happen often. Do you guys agree?" Our selection schema does not involve randomly selecting when a single read matches multiple alleles

martinghunt commented 6 years ago

Alleles: TTCAAA,TGGTCAA,TGGTCAAA. See the reads mapping just into the ends, like the one highlighted in red. They match >1 allele. I expect this kind of case to happen frequently.

img_20180309_172448

iqbal-lab commented 6 years ago

Particularly for minos, when combining calls with no nesting, this does happen

iqbal-lab commented 6 years ago

I note that https://github.com/iqbal-lab-org/gramtools/issues/90 does not mention the situation when one read matches one site, hitting multiple alleles, which results in counts of {0,1,2} and {0} and ... etc The intention was that the solution to that issue did not affect these counters - does that match witjh your understanding @rffrancon ?

martinghunt commented 6 years ago

I understood #90 in the handling selected option section "Record coverage information for all relevant alleles" in cases 3-5 to mean counts of eg {0,1,2} .. etc when the read hits more than one allele. ie once the position of the read has been chosen (whether or not there was >1 site from which to choose), the multiple allele counters are updated.

ffranr commented 6 years ago

Thank you for all of the information. As far as I can tell issue https://github.com/iqbal-lab-org/gramtools/issues/90 is still correct.

I've pushed a potential solution to this branch: https://github.com/iqbal-lab-org/gramtools/commits/map_single_site_multiple_alleles

Solution commit: https://github.com/iqbal-lab-org/gramtools/commit/2c776c2a6633fdcaaebce33e47f60f460ffb418b Feel free to review my code.

@martinghunt Please let me know if this solves the issue.

iqbal-lab commented 6 years ago

Yes issue 90 statement still correct imo

martinghunt commented 6 years ago

Thanks @rffrancon looks fixed. I reran the example above using 2c776c2 and looks good.

Allele counts are now: {1}: 50, {1,2}: 6, {0,1,2}: 2

The bug was causing me to incorrectly call homozygous variants as heterozygous. I count hets as incorrect. On that example, I've gone from 4148 true positives to 4317 true positives, so looks good.

iqbal-lab commented 6 years ago

Sorry I have not done code review yet - will doubtless end up with questions, will ping when have done

ffranr commented 6 years ago

Solution implemented: https://github.com/iqbal-lab-org/gramtools/commit/2c776c2a6633fdcaaebce33e47f60f460ffb418b

Martin has given feedback and it seems to be working as intended.