i)Disk serialisation of index is significantly smaller than in-RAM
For eg on TB dataset:
(yoda: /nfs/leia/research/iqbal/bletcher/Gramtools/profiling_gramtools/simulated_reads_150_30_reference_9/gramtools_runs/gram_k10_04157)
Disk
RAM
0.2GB
1.5GB
This is most likely simply due to sdsl::bit_compress called on each of the paths, the sa_intervals and the kmer_stats (which allow matching up sa_intervals and paths for each instance of a kmer mapping to graph)
Index made by gramtools
build
is memory hungry.Two things:
i)Disk serialisation of index is significantly smaller than in-RAM
For eg on TB dataset: (yoda: /nfs/leia/research/iqbal/bletcher/Gramtools/profiling_gramtools/simulated_reads_150_30_reference_9/gramtools_runs/gram_k10_04157)
This is most likely simply due to
sdsl::bit_compress
called on each of the paths, the sa_intervals and the kmer_stats (which allow matching up sa_intervals and paths for each instance of a kmer mapping to graph)But why not keep them compressed in RAM? The compression seems only to be reducing number of bits to represent the integer's value: http://algo2.iti.kit.edu/gog/docs/html/namespacesdsl_1_1util.html#ad5528f84e3036b9be3faf43a49f15b76
ii) Absolute index size
Most of the memory seems to lie in
SearchState
s (cf #142 )For TB genome of 4MB, we have an index of 1.5 GB in memory
For Plasmodium genome of 23MB, we have an index of ~60 GB in memory
Note in the latter case, it was ~80GB before cutting each uint64 in the
SearchState
struct to uint32.How can we do better?
Ideas: