iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
93 stars 15 forks source link

Reducing index size #145

Open bricoletc opened 5 years ago

bricoletc commented 5 years ago

Index made by gramtools build is memory hungry.

Two things:

i)Disk serialisation of index is significantly smaller than in-RAM

For eg on TB dataset: (yoda: /nfs/leia/research/iqbal/bletcher/Gramtools/profiling_gramtools/simulated_reads_150_30_reference_9/gramtools_runs/gram_k10_04157)

Disk RAM
0.2GB 1.5GB

This is most likely simply due to sdsl::bit_compress called on each of the paths, the sa_intervals and the kmer_stats (which allow matching up sa_intervals and paths for each instance of a kmer mapping to graph)

But why not keep them compressed in RAM? The compression seems only to be reducing number of bits to represent the integer's value: http://algo2.iti.kit.edu/gog/docs/html/namespacesdsl_1_1util.html#ad5528f84e3036b9be3faf43a49f15b76

ii) Absolute index size

Most of the memory seems to lie in SearchStates (cf #142 )

For TB genome of 4MB, we have an index of 1.5 GB in memory

For Plasmodium genome of 23MB, we have an index of ~60 GB in memory

Note in the latter case, it was ~80GB before cutting each uint64 in the SearchState struct to uint32.

How can we do better?

Ideas:

bricoletc commented 4 years ago

149 reduced index size by roughly a factor of the average number of alleles per site