theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Refine memory and cpus for frequent tasks #484

Closed andrewjpage closed 4 months ago

andrewjpage commented 4 months ago

:bug:

:pencil: Describe the Issue

Many tasks request a lot more CPU/memory resources than they really need. This substantially increases the cost of running each task. Focusing on the most frequently run tasks, assess how many resources are required in reality. For CPUs, check that they are actually used by the application and make a noticable difference. For example, if a task spends 3 minutes downloading a database and 2 second running analysis script, that script doesn't need multiple CPUs. Likewise for memory.

Examples: ts_mlst - currently uses 4 CPUs, 8GB RAM, but it only needs 1 CPU and 2GB RAM. snp_sites - currently uses 4GB RAM, but only needs 2GB. gambit - currently uses 16GB RAM, 8 CPUs, but only needs 2GB RAM and 1 CPU.

Systematically check each task, note how long it takes to run the core commands and the RAM. You can find the RAM of a command by looking at 'Maximum resident set size (kbytes)' when you prefix the command with '/usr/bin/time -v'.

kapsakcj commented 4 months ago

I've been waiting so long for someone else to suggest this. We could seriously optimize a lot of tasks

andrewjpage commented 4 months ago

Will separate this out into separate issues as its quite large