iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Handling huge indexes with few GBs of RAM #304

Open leoisl opened 1 year ago

leoisl commented 1 year ago

This issue describes a new feature in pandora to handle huge indexes with few GBs of RAM. The concrete example we have is an index with 186k PRGs, mostly linear. The main use case accounts for almost 1M PRGs. For this "small" example with 186k PRGs, running pandora compare with reads from 114 samples results in only 13.7k genes actually being found and being in the final multisample matrix/vcf (7.3%). pandora takes 15.6 GB of RAM to run compare in this case, but could possibly do it with just a fraction of this RAM if it loaded the index just for the relevant 13.7k genes, instead of all 186k genes. RAM usage will be much higher for 1M PRGs, and we want to keep this runnable for common user desktops, i.e. at most 13 or 14 GB of usage. For this use case, we have a fixed vcf-ref for each PRG, so we could also run pandora compare (or map in this case) per sample and merge results later. This feature is particularly important for running pandora compare/map for one sample, as even less genes will be loaded.

leoisl commented 1 year ago

To do this, need to implement: