iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
93 stars 15 forks source link

Systematic decisions to ignore regions of the PRG #96

Open iqbal-lab opened 6 years ago

iqbal-lab commented 6 years ago

Given a (minimum and) maximum read length, and a PRG,we should be able to

  1. decide that there are some places we will never be able to draw inference on, so we might as well ignore (which means do not put them in the kmer-index, nor store allele counts)

  2. decide there are some places that should always be analysed jointly (so that if we chunk the genome, those chunks should be analysed concurrently(as one chunk)

iqbal-lab commented 6 years ago

Proposal 1

  1. Divide the PRG into chunks of length 10000bp (or whatever). Say this is N chunks
  2. For each chunk calculate the list of kmers contained within (not just on/overlapping alleles)
  3. Calculate an N x N matrix, where the (i,j) entry is the number of kmers in common between chunk i and chunk j

Any chunk that shares too many kmers with lots of other chunks is a repeat where we will never be able to place reads reliably ---> mask out.

Any small set of chunks which share "enough" kmers, should be co-analysed.

What's good about proposal 1

What's bad about the proposal

iqbal-lab commented 6 years ago

Proposal 2

As proposal 1, except when comparing chunk i and chunk j, (assume diploid, but easy to modify what I say for other ploidies) sample 100 (or some number) of possible pairs of paths in chunk i, and also in chunk j, calculate the statistic for each pair of these, and then take the average,

iqbal-lab commented 6 years ago

Impact on Plasmodium falciparum (key use case for us): will immediately remove the crazy repeat regions where we should not waste time trying to quasimap, or variant call. Less wasted time, lower RAM use.

Impact on MHC (also key): this is trickier. There are places (of key importance) where there are say 2 sites, a long way apart, each with say 5000 alternate alleles. Now subsets of these alleles are very similar, within each site, and also a smaller subset are very similar between sites.

To be concrete:

Site 1 has: set 1 of 50 alleles which are very similar to each other set 2 of 60 alleles which are very similar to each other (but more differnet from all others) set 3 of 80 alleles which are very similar to each other (different to others)

Site 2 has: set 1 of 10 alleles which are almost identical to alleles from Site 1 set 2 set 2 of 50 alleles which are very similar to each other (different to all others) etc

Now, what should we do in terms of deciding whether site 1 and site 2 should be co-analysed? In theory they share a bunch of alleles. But for any given sample (human - pair of genomes), probably they dont have anything in site 1-set2 or site2-set 1. And if not, we don't need to analyse them together.

iqbal-lab commented 6 years ago

BTW, above I said something like Impact on Plasmodium falciparum (key use case for us) will immediately remove the crazy repeat regions I have since got a workaround for the moment, which was to build the Plasmodium PRG just from Cortex calls, and Cortex has very low power to detect mutations in repeat regions. So this is not blocking for the Pf analyses we are planning.

iqbal-lab commented 6 years ago

Also, since this issue was raised, Robyn and I had a conversation which resulted in

Proposal 3 As a one off calculation, sample read-length paths from the PRG, and then map them back to the PRG, and then mask out regions where the reads from there map to too many places

iqbal-lab commented 6 years ago

..where "mask out" could mean modify the original VCF/whatever and regenerate a better PRG

ffranr commented 6 years ago

There's a lot going on in this issue. I think that it would be beneficial to spin out certain problems into separate issues. For instance, I think that chunking the PRG as a memory scaling enhancement can be dealt with separately.

From what I understand, the regions that are beneficial to ignore are repeat regions. Surely there are already existing solutions which can identify repeat regions. @iqbal-lab will any of those solutions work for us?

iqbal-lab commented 6 years ago
  1. I agree about chunking for memory reduction - belongs elsewhere
  2. there are methods for finding repeat regions, on linear references. The word "repeat" means many things in this field, annoyingly. Can mean exact copy or very approximate copy, or can mean sequence is repetitive (eg ATGATATAGCCGTCGTCGTCATTATATAC). The reason I like simply sampling read-lengths from the PRG and querying them is that it is a direct measure of what we want to do. I agree there is a scaling problem if you really try to pull out every read-length from the graph, but you don't need to do all of them.

I think we need more clarity on what this is meant to achieve.

I think the onus is on me to do that, so taking ownership