mle-infrastructure / mle-toolbox

Lightweight Tool to Manage Distributed ML Experiments 🛠
https://mle-infrastructure.github.io/mle_toolbox/toolbox/
MIT License
3 stars 1 forks source link

Better control over resource usage #9

Closed RobertTLange closed 3 years ago

RobertTLange commented 3 years ago

Currently it is not directly obvious how many jobs are being launched (e.g. at each iteration of a grid batch evaluation) since we are always launching all random seeds for a single experiment config at once. So effectively, we are always getting num_evals_per_iter * num_iter_per_batch new experiment launches at the beginning of an episode.

Ideas: Ideally this should be solved by setting a total_number_running_jobs at every given point in time. The toolbox then automatically takes care of how much is scheduled at any given point in time.

This should potentially be addressed together with issue #8.

RobertTLange commented 3 years ago

It would also be nice to not only have an exclude_nodes option in experiment .yaml but to also restrict overall usage of each node on the cluster. E.g. say you want to leave 20% on each node (or specific ones) free. Denis may require a couple cores on cognition13 for GPU development. Maybe there is a grid engine command using CPU_LOAD?!

RobertTLange commented 3 years ago

This has been partially (but sufficiently 😄) addressed with issue #8 (asynchronous scheduling).