Open iqbal-lab opened 6 years ago
Proposal 1
Any chunk that shares too many kmers with lots of other chunks is a repeat where we will never be able to place reads reliably ---> mask out.
Any small set of chunks which share "enough" kmers, should be co-analysed.
What's good about proposal 1
What's bad about the proposal
Proposal 2
As proposal 1, except when comparing chunk i and chunk j, (assume diploid, but easy to modify what I say for other ploidies) sample 100 (or some number) of possible pairs of paths in chunk i, and also in chunk j, calculate the statistic for each pair of these, and then take the average,
Impact on Plasmodium falciparum (key use case for us): will immediately remove the crazy repeat regions where we should not waste time trying to quasimap, or variant call. Less wasted time, lower RAM use.
Impact on MHC (also key): this is trickier. There are places (of key importance) where there are say 2 sites, a long way apart, each with say 5000 alternate alleles. Now subsets of these alleles are very similar, within each site, and also a smaller subset are very similar between sites.
To be concrete:
Site 1 has: set 1 of 50 alleles which are very similar to each other set 2 of 60 alleles which are very similar to each other (but more differnet from all others) set 3 of 80 alleles which are very similar to each other (different to others)
Site 2 has: set 1 of 10 alleles which are almost identical to alleles from Site 1 set 2 set 2 of 50 alleles which are very similar to each other (different to all others) etc
Now, what should we do in terms of deciding whether site 1 and site 2 should be co-analysed? In theory they share a bunch of alleles. But for any given sample (human - pair of genomes), probably they dont have anything in site 1-set2 or site2-set 1. And if not, we don't need to analyse them together.
BTW, above I said something like
Impact on Plasmodium falciparum (key use case for us) will immediately remove the crazy repeat regions
I have since got a workaround for the moment, which was to build the Plasmodium PRG just from Cortex calls, and Cortex has very low power to detect mutations in repeat regions. So this is not blocking for the Pf analyses we are planning.
Also, since this issue was raised, Robyn and I had a conversation which resulted in
Proposal 3 As a one off calculation, sample read-length paths from the PRG, and then map them back to the PRG, and then mask out regions where the reads from there map to too many places
..where "mask out" could mean modify the original VCF/whatever and regenerate a better PRG
There's a lot going on in this issue. I think that it would be beneficial to spin out certain problems into separate issues. For instance, I think that chunking the PRG as a memory scaling enhancement can be dealt with separately.
From what I understand, the regions that are beneficial to ignore are repeat regions. Surely there are already existing solutions which can identify repeat regions. @iqbal-lab will any of those solutions work for us?
I think we need more clarity on what this is meant to achieve.
I think the onus is on me to do that, so taking ownership
Given a (minimum and) maximum read length, and a PRG,we should be able to
decide that there are some places we will never be able to draw inference on, so we might as well ignore (which means do not put them in the kmer-index, nor store allele counts)
decide there are some places that should always be analysed jointly (so that if we chunk the genome, those chunks should be analysed concurrently(as one chunk)