khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

Add threads options, why and how #54

Open jfouret opened 2 months ago

jfouret commented 2 months ago

Hi,

You could easily add an option to control the number of threads.

A lots of people use an HPC cluster with job scheduler systems, (SLurm, Nextflow, AWS batch etc...) where one need to reserve a precise number of threads (e.g. 8) but ultimately the jobs runs on machines where the CPU count is higher.

Sometimes it can be very tricky to setup the number of cpus to reserve depending on the number of samples, mostly in a context where we integrate your tool in an automated workflow.

It appears to me that you could easily add this option, for example:

parser.add_argument(
  '--threads',
  type=int,
  default=os.cpu_count(),
  help='Number of threads to use (default: number of CPU cores)'
)

with:

            with mpctx.Pool(processes=min(min(os.cpu_count(), args.threads),
                                          len(input_files))) as pool:

You can Also combine the args sequential with threads where you switch to sequential when threads is equal to 1.

Bests,

khyox commented 2 months ago

Hi @jfouret— Thanks for the suggestion! At the beginning, typically, cores > samples, but his has changes drastically over time and now it's the opposite situation which is common, so it makes sense to add that argument. The combination with sequential can be a good alternative too. Do you want to send a PR for those changes?

jfouret commented 2 months ago

I can try when I have some time soon. For backward compatibility, let's keep --sequential with priority over --threads that default to os.cpu_count().