Closed JeanMainguy closed 1 month ago
Benchmark on an E. coli pangenome made of 5k genomes, resulting in a 2.4 GB HDF5 file.
command | PeakMemory improve_fasta_cmd (GB) | PeakMemory v2.1.2 (GB) | time improve_fasta_cmd (min) | time v2.1.2 (min) |
---|---|---|---|---|
--genes softcore |
5.2 | 37.4 | 9.0 | 26.2 |
--genes core |
4.8 | 36.3 | 3.4 | 24.1 |
--gene_families softcore |
4.8 | 36.3 | 3.3 | 24.1 |
--genes all |
3.0 | 33.6 | 9.5 | 25.4 |
--prot_families core |
4.8 | 32.9 | 1.4 | 20.6 |
--genes persistent |
3.8 | 32.5 | 7.5 | 25.3 |
--genes rgp |
2.4 | 32.1 | 4.2 | 25.1 |
--genes shell |
2.6 | 31.9 | 5.1 | 24.4 |
--genes cloud |
1.2 | 31.4 | 2.3 | 23.3 |
--gene_families cloud |
1.1 | 31.3 | 2.0 | 23.1 |
--gene_families all |
1.1 | 31.3 | 1.6 | 23.4 |
--gene_families shell |
1.1 | 31.3 | 1.5 | 23.2 |
--genes module_0 |
1.1 | 31.3 | 2.1 | 22.5 |
--genes module_1 |
1.1 | 31.3 | 1.6 | 23.1 |
--gene_families module_1 |
1.1 | 31.3 | 1.4 | 23.1 |
--gene_families module_0 |
1.1 | 31.3 | 1.4 | 23.1 |
--prot_families rgp |
1.2 | 28.4 | 0.4 | 21.5 |
--prot_families persistent |
0.5 | 11.1 | 0.1 | 5.2 |
--prot_families shell |
0.5 | 11.1 | 0.1 | 5.2 |
--prot_families all |
0.5 | 11.1 | 0.1 | 5.3 |
This PR significantly reduces memory usage and speeds up execution !
The
ppanggolin fasta
command could use less memory. Right now, it loads a lot of data to build the pangenome object, which makes filtering sequences easy but comes with a big memory and time cost, especially for large pangenomes with thousands of genomes.This PR changes the command to read directly from the HDF5 tables and write the sequences on the fly, reducing the load. Some intermediate tables are only loaded when needed.
The arguments
--genes
,--proteins
,--prot_families
, and--gene_families
have been optimized. Only--regions
works the same as before, as it needs more work to optimize.