Proper concurrent querying of alleles

bricoletc commented 4 years ago

So the gramtools site & allele marker encoding is supposed to allow alleles to be queried concurrently

Here's the relevant excerpt from Sorina et al's paper

If the search cursor is next to the end of a site, all alternative alleles from that site need to be queried. Their ends are sorted together in the BWM because of the markers, so they can be queried concurrently by adding the SA-interval of suffixes starting with all numbers marking that site (even and odd).

Yet as it turns out this is not actually done in the code. Here's the relevant snippet: https://github.com/iqbal-lab-org/gramtools/blob/41a1dbd93948d1bca7938c9c286296ae23c239f7/libgramtools/src/search/search.cpp#L358-L384

The for loop initialises one size one SA Interval per allele.

The advantage of doing this is that we immediately specify the allele ID attached to each SearchState. But the disadvantage is that if you have 5000 alleles, you will have 5000 SearchStates. And each time one of those hits another variant site, it will seed one new SearchState per allele. Etc...

bricoletc commented 4 years ago

Pretty sure this leaves a massive memory footprint in the index.

So I have implemented a change where we query all alleles marked with an even marker concurrently.

Meaning for a site AG 5C6T6A5 AG with three alleles, C and T get queried using a single SearchState.

To get the A, I think we need to change the data representation to: AG 5C6T6A6 AG. Otherwise we have to query that last allele on its own, because it sorts with the beginning of the site (pair of 5's), and we do not have a guarantee that it sorts with all the 6's in the suffix array.

bricoletc commented 4 years ago

Here's the result of running this change on the pf3k + DBLMSP vcf of Plasmodium on chromosomes 1 and 2 only: Space_used.pdf

On the x are the files that build produces, here using k=8 and the --all-kmers flag. The actual index is stored in kmers, kmers_stats, sa_intervals and paths. For the latter two we get about 3-fold size reduction!

Also in this dataset the fm-index and the various masks are relatively large but on all chromosomes and at k=11, the four files mentioned above take up 9.8 GB of the 11GB of the build files: sa_intervals is 3GB and paths is 5.7GB.

As for build time it went down from 107 to 66 seconds.

bricoletc commented 4 years ago

And now for quasimap. I mapped 23.9 million reads from a single Plasmodium sample to the index generated above, using 10 threads.

Commit	Run Time(h)	CPU Time(h)	Max. RAM (GB)	Avg RAM (GB)
a6e9094 (before changes)	31	123	8.7	5.3
8b02af1 (after changes)	6.8	30.9	2.8	2.2

So that's a 4-fold decrease in run-time! RAM 2-3 fold decrease. Also interesting to see that the difference between Max RAM and Avg RAM goes down: ratio is 1.64 before changes and 1.27 after. This could be due to less 'blowup' for reads mapping to variant-dense regions.

Sanity check: the coverage generated is exactly identical for allele_sum_coverage and for grouped_allele_counts_coverage.json. (Note that for the latter a simple diff is not enough and I had to check each site's dictionary; this could be due to multithreading creating allele groups in different order).

allele_base_coverage.json do not have exactly same output: 37 sites out of 88,538 differ.

iqbal-lab commented 4 years ago

this is great!

iqbal-lab commented 4 years ago

What does this mean for a whole genome prg BTW?

bricoletc commented 4 years ago

@iqbal-lab sorry for late reply. I've been working on the last part of this comment: https://github.com/iqbal-lab-org/gramtools/issues/149#issuecomment-513816458 I will run with those changes as soon as I'm convinced it workds and also full genome and report results!

bricoletc commented 4 years ago

I ran 0736327b1404c52e287beb385b6c78f4d930abf9 on the Pf Chroms 1&2 (https://github.com/iqbal-lab-org/gramtools/issues/149#issuecomment-515382899) and runtime is down to 5.25 hours (CPU time: 24.16 horus). Max and average RAM don't change.

bricoletc commented 4 years ago

Whole genome: we can't build the index for kmer size of 10 or 11 on a6e9094 on yoda: it exceeds 116GB of peak RAM on k=11 (average RAM: 43.7GB). This is before proper concurrent allele querying.

After modifications, on 73e4835, we can build k=11 using 38Gbytes peak RAM and 12.4 GB average RAM.

So we get at least 3 fold RAM reduction.

In general I think the RAM and time reduction due to these changes is proportional to the average number of alleles per site genome-wide.

Good to close now @iqbal-lab ?

iqbal-lab commented 4 years ago

Agreed!

iqbal-lab-org / gramtools

Proper concurrent querying of alleles #149