iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Why the cache? #142

Closed bricoletc closed 5 years ago

bricoletc commented 5 years ago

From a7db53abc75b4ceb2d347337b1d3b6ec52ebe2a8 a SearchState includes a caching structure consisting of a VariantLocus (pair of site marker and allele ID) and a bool marking if cache is populated.

This does not seem necessary as traversed site/allele combinations can be greedily committed to variant_site_path when encountered, rather than cached and committed later (upon base extension)

bricoletc commented 5 years ago

Here is heap profile of https://github.com/iqbal-lab-org/gramtools/commit/a7db53abc75b4ceb2d347337b1d3b6ec52ebe2a8 on TB simulated dataset for build command: a7db53_massif_tbdataset

Here is the heap profile on the same dataset after removing the cache:

cache_check_massif_tbdataset

The red and orange bars indicate SearchState allocation. There is a decrease in memory usage.

bricoletc commented 5 years ago

As a sanity check, quasimap maps the exact same number of reads (585862 out of 1756600 in total) against the indexes built by the respective commits.

Here is the diff output comparing the allele_sum_coverage files produced:

104c104
< 0 0 5 0
---
> 0 0 4 0
629c629
< 0 3
---
> 0 2
934c934
< 15 18
---
> 14 20
2436,2437c2436,2437
< 1 3
< 0 4
---
> 1 4
> 0 5
2460c2460
< 21 14
---
> 21 16
3130c3130
< 0 3
---
> 0 1
3796c3796
< 0 3
---
> 0 1
3798c3798
< 1 1 1
---
> 0 0 2
3968c3968
< 4 3
---
> 2 4
4171c4171
< 0 1
---
> 0 2
4211,4212c4211,4212
< 0 0 25
< 21 1
---
> 0 0 26
> 22 0
4219,4220c4219,4220
< 0 0
< 12 9
---
> 0 2
> 6 11
4225d4224
< 0 4
4227,4228c4226,4228
< 0 4
< 0 4
---
> 0 1
> 0 3
> 0 3
4239c4239
< 17 5
---
> 15 11
4243c4243
< 0 1
---
> 0 0
4245c4245
< 0 2
---
> 0 0
4557,4558c4557,4558
< 3 31 4 6 3
< 24 8
---
> 3 32 4 6 3
> 25 7

The files are close. Identity is not expected due to random selection for multi-mapping reads.

iqbal-lab commented 5 years ago

looks interesting. i'd like to get my head around whether these diffs really are ok. see you next week!

iqbal-lab commented 5 years ago

ah....could be need cache because of this https://github.com/iqbal-lab-org/gramtools/issues/90

bricoletc commented 5 years ago

For the record, here is what happened:

  1. I added a --seed parameter to the gramtools quasimap back-end (c++) to ensure reproducibility when mapping reads across different runs.
  2. Removing the cache and fixing the seed initially did not produce exactly the same result. Digging into this, I found this was not due to caching itself, but to kmer index serialisation being lossy. Specifically, when an indexed kmer has a mapping ending inside a variant site, we lose the information that we were inside the variant side when serialising. This caused duplication of variant site path traversal recording in quasimap.
  3. Now, having removed the cache and made the appropriate checks to deal with 2., we get identical outputs between faab59c and 47a15da (which is before cache removal). This is on mapping 250,000 reads from a Plasmodium cross onto pf3k + dblsmp1/2 prg.

RAM use reduction is ~8% on k=8.