iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Performance of kmers generation routine too dependant on variant site distribution #60

Closed ffranr closed 6 years ago

ffranr commented 7 years ago

Generating kmers from a region of the linear PRG involves generating and breaking up genome paths. These paths are generated as Cartesian products of ordered lists of alleles.

If there is a very large number of alleles within a region of the PRG (high density of variant sites) the number of unique genome paths could be extremely large.

The script should be modified so that the size of the regions which are considered for generating genome paths are minimized.

Note: the script performs adequately on the WG dataset but takes much longer on the "prg_24Mb_human_chr21_with_44439_vars_based_on_restricted_ref" (human) dataset. Even though the human dataset contains less variant sites. They must be densely distributed.

ffranr commented 7 years ago

This is also an opportunity to rewrite the routine in C++. And possible make it a parallel algorithm.

ffranr commented 6 years ago

Done but not parallel. Performance after reimplementation in C++ seems adequate.