labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
242 stars 29 forks source link

Improved Memory Efficiency for `ppanggolin fasta` Command #283

Closed JeanMainguy closed 1 month ago

JeanMainguy commented 2 months ago

The ppanggolin fasta command could use less memory. Right now, it loads a lot of data to build the pangenome object, which makes filtering sequences easy but comes with a big memory and time cost, especially for large pangenomes with thousands of genomes.

This PR changes the command to read directly from the HDF5 tables and write the sequences on the fly, reducing the load. Some intermediate tables are only loaded when needed.

The arguments --genes, --proteins, --prot_families, and --gene_families have been optimized. Only --regions works the same as before, as it needs more work to optimize.

JeanMainguy commented 1 month ago

Benchmark on an E. coli pangenome made of 5k genomes, resulting in a 2.4 GB HDF5 file.

command PeakMemory improve_fasta_cmd (GB) PeakMemory v2.1.2 (GB) time improve_fasta_cmd (min) time v2.1.2 (min)
--genes softcore 5.2 37.4 9.0 26.2
--genes core 4.8 36.3 3.4 24.1
--gene_families softcore 4.8 36.3 3.3 24.1
--genes all 3.0 33.6 9.5 25.4
--prot_families core 4.8 32.9 1.4 20.6
--genes persistent 3.8 32.5 7.5 25.3
--genes rgp 2.4 32.1 4.2 25.1
--genes shell 2.6 31.9 5.1 24.4
--genes cloud 1.2 31.4 2.3 23.3
--gene_families cloud 1.1 31.3 2.0 23.1
--gene_families all 1.1 31.3 1.6 23.4
--gene_families shell 1.1 31.3 1.5 23.2
--genes module_0 1.1 31.3 2.1 22.5
--genes module_1 1.1 31.3 1.6 23.1
--gene_families module_1 1.1 31.3 1.4 23.1
--gene_families module_0 1.1 31.3 1.4 23.1
--prot_families rgp 1.2 28.4 0.4 21.5
--prot_families persistent 0.5 11.1 0.1 5.2
--prot_families shell 0.5 11.1 0.1 5.2
--prot_families all 0.5 11.1 0.1 5.3

This PR significantly reduces memory usage and speeds up execution !