iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Coverage information missing for reads which map exactly within a single allele #89

Closed ffranr closed 6 years ago

ffranr commented 6 years ago

This probably affects both kmers (keys to the kmer index) and reads.

Martin reported this on slack:

I think I’ve found a bug in quasimapping. When there is a long allele, the reads only map to the start of it. It looks like it is only counting reads as mapped if their start position is before the start of the allele. Example in here: /nfs/research1/zi/mhunt/Minos_eval/Clockwork/Pipeline_root/00/00/00/55/55/Pipelines/1/variant_call/0.3.1.ref.21/minos/gramtools.out/quasimap_outputs/1519406982_ksize15/allele_base_coverage.json This is simulated data and there definitely should be coverage across all the long alleles. But instead we get eg this: [39,38,38,38,38,38,37,36,35,34,33,32,32,31,31,30,29,29,29,29,29,29,29,28,26,26,24,22,21,21,20,20,19,19,19,19,18,17,17,17,15,15,15,15,15,15,14,13,12,11,11,10,10,10,10,10,9,9,9,9,7,7,6,6,6,4,3,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] This is why minos can’t call long insertions. Where “long” = longer than the read length, I guess

ffranr commented 6 years ago

This unit tests captures the current functionality (which will be changed with this issue): https://github.com/iqbal-lab-org/gramtools/blob/51c58d6b08ef447229e52b5d4f94001effbd3733/libgramtools/tests/quasimap/test_quasimap.cpp#L236

ffranr commented 6 years ago

This is not as trivial as I first thought. However, I'm getting there.

The non-trivial bit: can't just assign site-allele to an SA interval. The interval may contain indexes which are outside an allele. That is to say, handling reads which map exactly to (at least) two places: entirely within an allele and outside of a site altogether.

The code so far:

SearchStates handle_allele_encapsulated_state(const SearchState &search_state,
                                              const PRG_Info &prg_info) {
    SearchStates new_search_states = {};

    bool cached_state = false;
    SearchState new_search_state = {};

    for (uint64_t sa_index = search_state.sa_interval.first;
         sa_index <= search_state.sa_interval.second;
         ++sa_index) {

        auto read_start_index = prg_info.fm_index[sa_index];
        auto site_marker = prg_info.sites_mask[read_start_index];
        bool read_within_site = site_marker != 0;
        if (not read_within_site) {
            if (cached_state) {
                new_search_states.push_back(new_search_state);
                cached_state = false;
            }
            continue;
        }

        auto allele_id = prg_info.allele_mask[read_start_index];

        if (not cached_state) {
            new_search_state = SearchState {
                    SA_Interval {sa_index, sa_index},
                    VariantSitePath {
                            VariantSite {site_marker, allele_id}
                    }
            };
            cached_state = true;
            continue;
        }

        ++new_search_state.sa_interval.second;
    }
    return new_search_states;
}

SearchStates handle_allele_encapsulated_states(const SearchStates &search_states,
                                               const PRG_Info &prg_info) {
    SearchStates new_search_states = {};

    for (const auto &search_state: search_states) {
        bool has_path = not search_state.variant_site_path.empty();
        if (has_path) {
            new_search_states.emplace_back(search_state);
            continue;
        }

        SearchStates split_seach_states = handle_allele_encapsulated_state(search_state, prg_info);
        for (const auto &split_search_state: split_seach_states)
            new_search_states.emplace_back(split_search_state);
    }
    return new_search_states;
}

The above code is called like this:

new_search_states = handle_allele_encapsulated_states(new_search_states, prg_info);
ffranr commented 6 years ago

@martinghunt Can you please provide an update to this issue given the latest changes (https://github.com/iqbal-lab-org/gramtools/commit/47e41d7691954eea17b195e49bbe39729e389f7d)?

ffranr commented 6 years ago

@martinghunt says this issue can be closed.