How to learn about worker usage

mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

https://mschubert.github.io/clustermq/

Apache License 2.0

146 stars 27 forks source link

How to learn about worker usage #205

Closed wlandau closed 8 months ago

wlandau commented 4 years ago

My team and I are trying to conserve resources on our SGE cluster, and we are wondering if there is a way to tell which clustermq workers are actually occupied at any given time. Can a QSys object provide such information? This would really help us iron out best practices for our large drake-powered simulation pipelines. Usually we just throw 100 workers at the project and hope it's the right number, and it would be great to have empirical evidence to make adjustments.

My apologies if I overlooked something obvious.

cc @cravenlilly.

mschubert commented 4 years ago

What information do you need that the summary does not provide?

It has:

Total runtime
Average CPU usage on worker
Max memory required by worker (rough estimate from gc() that blindly adds 200 mb overhead)

This is generated from QSys$summary_stats().

It is called during QSys$cleanup(), but you can call it yourself (I think only before cleanup). Note that workers only send this information when shutting down, not with each call.

wlandau commented 4 years ago

It would be extremely helpful to see those results on a worker by worker basis (i.e. private$worker_stats). I would love to be able to set a global option to tell clustermq to write worker-specific CPU and memory usage to a flat spreadsheet file at the end of a pipeline.

mschubert commented 4 years ago

I see that this could be useful.

I'll consider to include it in the 0.9 refactoring.

wlandau commented 4 years ago

Thanks, Michael!

mschubert commented 8 months ago

This is now implemented from 0aa0e4ab44b69a4a9a0fd85455e93c22550c5677 on:

w = workers(3)
w$recv()
w$info() # <--- returns a data.frame of all workers with their stats