mle-infrastructure / mle-toolbox

Lightweight Tool to Manage Distributed ML Experiments 🛠
https://mle-infrastructure.github.io/mle_toolbox/toolbox/
MIT License
3 stars 1 forks source link

Allow user to easily adapt job submission options/template #39

Closed RobertTLange closed 2 years ago

RobertTLange commented 3 years ago

As of right now a limitation of the toolbox is the lack of flexibility and customisation to other peoples resources. Different clusters (SGE or Slurm) may have different ways how to query/assign resources. For example the number of GPUs:

https://github.com/RobertTLange/mle-toolbox/blob/b659278184d68a21f9f212e8e93a9193719b3ef0/mle_toolbox/experiment/sge_job_management.py#L72

https://github.com/RobertTLange/mle-toolbox/blob/b659278184d68a21f9f212e8e93a9193719b3ef0/mle_toolbox/experiment/slurm_job_management.py#L69

In order for this to be open sourced, we need to make this way more general. The user should simply set the syntax at initialisation. We could store this then it the mle_config.toml file. Afterwards the toolbox knows how to do it and the user can later on change things if the resources change. This can/should be integrated into mle-init and issue #24.

Potentially also have a look at how ray and torch lightning do things.

RobertTLange commented 2 years ago

This is already partially addressed in mle-scheduler as well as in the simplified mle_config.toml