computational efficiency of pggb

pangenome / pggb

the pangenome graph builder

https://doi.org/10.1101/2023.04.05.535718

MIT License

346 stars 37 forks source link

computational efficiency of pggb #370

Open yihangs opened 6 months ago

yihangs commented 6 months ago

Hi,

A recent paper, "Comparing methods for constructing and representing human pangenome graphs", shows that pggb cannot construct graphs from 104 human haplotypes because of low computational efficiency. This result kind of contradicts to the results shown in the paper "A draft human pangenome reference", where pggb is used to construct graphs from around 90 haplotypes, a number very close to 104. Therefore, I am wondering the computational efficiency of pggb, can it deal with hundreds or even thousands of haplotypes? If not, what would be the key bottleneck?

Thanks!

ekg commented 6 months ago

It seems that the cited paper had a misunderstanding about how the variation graph building methods are currently used in the HPRC. PGGB (and minigraph-cactus) are run on each chromosome individually. This allows for high parallelism in graph building. Just throwing all data from all human chromosomes in the HPRC into a single node is likely to take a very long time and produce a result which may be hard to understand. Improving the partitioning process is critical to enabling this kind of use. To minimize bias, we propose a community detection method to partition the graph building process into pieces that each can be processed independently on a cluster. Refining this is the main area of ongoing work with PGGB, as it will lead to automatic and unbiased graph building in any context, not just those where there is a clear partitioning by chromosome (or in humans, most chromosomes, the sex chromosomes, and the acrocentrics).

subwaystation commented 6 months ago

Also, the pggb version used in the paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2 was 0.2.0 which was released Nov 2021. Since then lot's of performance updates were integrated into pggb. I myself was already able to build a pangenome graph directly from all 90 haplotypes at once (not per chromosome) using https://nf-co.re/pangenome. This pipeline directly mirrors https://github.com/pangenome/pggb/blob/master/partition-before-pggb followed by https://github.com/pangenome/pggb/blob/master/pggb. While I did not evaluate pggb 0.2.0, the current tools of pggb for sure are up to the task(s) executed in the mentioned paper. Even 104 haplotypes would run smoothly.

yihangs commented 5 months ago

Thank you for the reply! I have another pggb related question, posted here: https://github.com/ekg/seqwish/issues/121. I am wondering if you have any idea about that.

Thanks!