phylo42 / IPK

Inference of phylo-k-mers
MIT License
4 stars 1 forks source link

Skip filtering computations when not requested #7

Closed blinard-BIOINFO closed 4 months ago

blinard-BIOINFO commented 2 years ago

Currently, even when filtering is not requested, xpas-build operates a filtering step, which can results in 75% of useless computations.

See below example where phylo-k-mers computation took 9 seconds while useless filtering computations are 40 seconds.

# on lamarck
$ /beegfs/linard/CLAPPAS/clapas_core/bin/lib/xpas/build/xpas-build-aa-pos --merge-branches -k 5 -o 7.0 --ar-binary "$(which raxml-ng)" --refalign /beegfs/linard/CLAPPAS/33208/pruned/3BDA0/Ax/A0_nx0_la.align --reftree /beegfs/linard/CLAPPAS/33208/pruned/3BDA0/Tx/T0_nx0_la.tree.raxml.bestTree.rerooted --ar-dir /beegfs/linard/CLAPPAS/33208/pruned/3BDA0/Dx/A0_nx0_la/AR --workdir /beegfs/linard/CLAPPAS/33208/pruned/3BDA0/Dx/A0_nx0_la/k5_o7.00 --uncompressed

Building database: [stage 1 / 2]:
Calculated 6206257 phylo-k-mers.
Calculation time: 9472

Building database: [stage 2 / 2]:
Kept 279656 / 279656 k-mers (100%) | 3794695 / 3794695 entries (100%).
Filtering time: 40571     <<<<====== !!!!!!!!

Building database: Done.
Built 279656 phylo-k-mers for 279656 different k-mers.
Total time (ms): 50043            

Saving database to: /beegfs/linard/CLAPPAS/33208/pruned/3BDA0/Dx/A0_nx0_la/k5_o7.00/DB_k5_o7.0.rps...
Compression: OFF
Time (ms): 187
nromashchenko commented 2 years ago

There is no filtering happening, all this time is spent to merge temporary node databases (see "Kept ... 100% k-mers"). This example is slow because of intensive I/O, which we can reduce by changing the db_builder::_num_batches parameter. Will do it in the future versions

nromashchenko commented 4 months ago

This should not happen anymore with a791b83c and a5fa858cb (unless you enable --on-disk which I don't advise for CLAPPAS).

I'll release a new version soon. Let me know if you encounter future problems with I/O.