iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

quasimap: report detailed allele coverage info #81

Closed martinghunt closed 6 years ago

martinghunt commented 6 years ago

Need quasimap to report two things at each variant:

  1. Report the read coverage at every position of every allele. Include multiple mapping, ie if a read maps to two alleles, then increment the read coverage at the relevant positions on both alleles. So the output will be something like a list of ints for each allele (length of each list == length of that allele).

  2. Number of reads mapping to each combination of alleles. Output will be something along the lines of a hash of allele combination -> number of reads. Only include combinations that are actually seen in the reads, meaning do not have keys where values are zero. eg if read0 maps to allele1, read1 maps to allele1, and read2 maps to allele1 and allele2, then the hash would be: {{allele1}: 2, {allele1, allele2}: 1}. In particular, read2 did not contribute to the count of {allele1} or {allele2}, and it is implicit that {allele2}: 0.

iqbal-lab commented 6 years ago

Also, if a read maps to two different sites, don't just add its counts in. Exclude, store to different array, or random assignment

ffranr commented 6 years ago

Output Data Structures

Allele base counts

Output file

../quasimap_output_directory/allele_base_counts.json

JSON

{
    "allele_base_counts": [
        [
            [0, 0, 0],
            [1, 1, 0]
        ],
        [
            [0, 0, 1],
            [2, 2, 0],
            [2, 2, 0, 1, 3]
        ]
    ]
}

Example

import json
json_data = json.loads(...)
allele_base_counts = json_data["allele_base_counts"]

variant_site = allele_base_counts[0]
allele = variant_site[0]
base_count_coverage = allele[0]

Grouped allele counts

Output file

../quasimap_output_directory/grouped_allele_counts.json

JSON

{
    "grouped_allele_counts": {
        "site_counts": [
            {
                "0": 10,
                "1": 3,
                "14": 10
            },
            {
                "3": 30,
                "2": 2,
                "14": 1
            }
        ],
        "allele_groups": {
            "0": [0, 2],
            "1": [0, 2, 3],
            "2": [0, 2, 4],
            "3": [2, 5],
            "14": [7, 8]
        }
    }
}

Example

import json
json_data = json.loads(...)
grouped_allele_counts = json_data["grouped_allele_counts"]

variant_sites = grouped_allele_counts["site_counts"]
allele_groups = grouped_allele_counts["allele_groups"]

site = variant_sites[0]
for allele_group_id, count in site.iter():
    allele_ids = allele_groups[allele_group_id]
ffranr commented 6 years ago

This should be complete. Will leave the issue open for now encase Martin finds any issues.

ffranr commented 6 years ago

@martinghunt Do you agree that the new output data structures are correct? If so I'll close this issue.

martinghunt commented 6 years ago

Looks good. I'll close it now, and can always reopen if I find anything that looks wrong.