Closed iqbal-lab closed 6 years ago
Specifically, in the file grouped_allele_counts_coverage.json
, every value (which is a list) in json ['grouped_allele_counts']['allele_groups']
has length 1.
This is related to this issue: https://github.com/iqbal-lab-org/gramtools/issues/90 @martinghunt Can you give me a minimal example of this error?
Not really minimal, but a real one on the EBI cluster. I'll message you...
The example I looked into in detail had equivalence class counts:
{0}:1, {1}:51, {2}:6
The code used to produce this on the same data:
{0,1,2}:1, {1,2}: 6, {1}: 50, {2}: 1
ie the allele combinations are not being counted.
@martinghunt - this is on tip of master?
"version_report": {
"version_number": "0.5.0",
"last_git_commit_hash": "47e41d7691954eea17b195e49bbe39729e389f7d",
"current_git_branch": "HEAD",
"truncated_git_commits": [
"47e41d7 - Robyn Ffrancon, 24 hours ago : read mapping selection for coverage includes multiple allele encapsulated mappings and non-site mappings",
"7dee1b7 - Robyn Ffrancon, 24 hours ago : WIP: read mapping to non-site split on SA index; allele encapsulated mappings identified",
"14f468d - Robyn Ffrancon, 9 days ago : WIP: functionality for identifying read variant site paths when reads quasimap in their entirety to long alleles",
"549fc5c - Robyn Ffrancon, 9 days ago : added assert to ensure kmer bases are valid for all kmers serialization",
"97b3a49 - Robyn Ffrancon, 13 days ago : added parameters for setting the maximum thread count"
]
}
Either issue https://github.com/iqbal-lab-org/gramtools/issues/90 is wrong. Or my implementation of the schema from issue https://github.com/iqbal-lab-org/gramtools/issues/90 is wrong.
I understand that the code produces a different result after implementing issue https://github.com/iqbal-lab-org/gramtools/issues/90.
I would like a minimal example so that we can revise issue https://github.com/iqbal-lab-org/gramtools/issues/90 or fix my implementation.
Suppose case 3 in #90. A read maps to one site, but to alleles 4 and 5. The counter for {4,5} should get incremented, but that doesn't look like it's happening, which is a regression from implementing issue #81
i can't see case 2?
@iqbal-lab I edited to say case 3, not 2 (but slack doesn't show that). It's where a read maps to a single site.
Consider a read which maps to a single site, in order to get the multiple allele counter event (example was for allele group {4,5} above) the read either needs to start or end within that site. Not only that, but the alleles within the site need have some similarity.
Given our new selection schema, I don't expect that to happen often. Do you guys agree?
Is there another case where a read completely overlaps a site and maps through two different alleles?
@martinghunt 's example is one where multiple alleles agree at the start ( or end) so a single read which mostly lies outside the site, overlaps two alleles at the end
Martin - could you stick the site and the pileup in?
We're not talking about a read covering a whole site and a second allele
"Given our new selection schema, I don't expect that to happen often. Do you guys agree?" Our selection schema does not involve randomly selecting when a single read matches multiple alleles
Alleles: TTCAAA,TGGTCAA,TGGTCAAA
.
See the reads mapping just into the ends, like the one highlighted in red. They match >1 allele. I expect this kind of case to happen frequently.
Particularly for minos, when combining calls with no nesting, this does happen
I note that https://github.com/iqbal-lab-org/gramtools/issues/90 does not mention the situation when one read matches one site, hitting multiple alleles, which results in counts of {0,1,2} and {0} and ... etc The intention was that the solution to that issue did not affect these counters - does that match witjh your understanding @rffrancon ?
I understood #90 in the handling selected option section "Record coverage information for all relevant alleles" in cases 3-5 to mean counts of eg {0,1,2} .. etc when the read hits more than one allele. ie once the position of the read has been chosen (whether or not there was >1 site from which to choose), the multiple allele counters are updated.
Thank you for all of the information. As far as I can tell issue https://github.com/iqbal-lab-org/gramtools/issues/90 is still correct.
I've pushed a potential solution to this branch: https://github.com/iqbal-lab-org/gramtools/commits/map_single_site_multiple_alleles
Solution commit: https://github.com/iqbal-lab-org/gramtools/commit/2c776c2a6633fdcaaebce33e47f60f460ffb418b Feel free to review my code.
@martinghunt Please let me know if this solves the issue.
Yes issue 90 statement still correct imo
Thanks @rffrancon looks fixed. I reran the example above using 2c776c2 and looks good.
Allele counts are now: {1}: 50, {1,2}: 6, {0,1,2}: 2
The bug was causing me to incorrectly call homozygous variants as heterozygous. I count hets as incorrect. On that example, I've gone from 4148 true positives to 4317 true positives, so looks good.
Sorry I have not done code review yet - will doubtless end up with questions, will ping when have done
Solution implemented: https://github.com/iqbal-lab-org/gramtools/commit/2c776c2a6633fdcaaebce33e47f60f460ffb418b
Martin has given feedback and it seems to be working as intended.
Martin spotted a regression on Friday night. Current code never records >0 for more than one allele (tested on 138 samples), whereas previous commits on the same samples did.
Allele counts from that old run: {0,1,2}=>1, {1,2} => 6, {1} => 50, {2} => 1
Allele counts from current run 0 => 1, 1=>51, 2=>6
Raising this from slack channel so it's not lost. Martin is at a confernece so hasn';t had time to raise it, just sticking my nose in so it's not lost