mle-infrastructure / mle-toolbox

Lightweight Tool to Manage Distributed ML Experiments 🛠
https://mle-infrastructure.github.io/mle_toolbox/toolbox/
MIT License
3 stars 1 forks source link

Automatic "canceled" experiment status detection in protocol db #22

Closed RobertTLange closed 2 years ago

RobertTLange commented 3 years ago

As of now experiments which where aborted/terminated before complication are listed as running in the status variable of the pickledb instance. Given that the DB doesn't directly receive the info of the termination, it would be good to have some form of check/mechanism that sets the status to aborted or canceled in this case.

One potential way of doing so, would be via time constraints of experiments. For example if the user provides a time limit per job, we could calculate the maximum time for the entire experiment. If this time is exceeded and the status of the experiment is still running, we should change it. This could be checked everytime the db instance is loaded. The steps would look as follows:

  1. Start-up experiment with time_per_job: "dd:hh:mm" given in single_job_args.
  2. Calculate total time for experiment as total_experiment_time = num_search_batches * time_per_job.
  3. Add maximum of job runtime to the db as max_completion_time = start_time + total_experiment_time.
  4. Whenever the db is loaded, check if current_time > max_completion_time and status == "running". If so - set status to aborted.
RobertTLange commented 3 years ago

Ideally, we would also have something similar for the case where an experiment fails due to something being wrong in the code. In this case the status would be set to failed. In total there would be 5 experiment status: [running, completed, canceled, aborted, timed_out]. This may be a little harder since we need a bullet proof way for detecting these failures. I am not sure how this would for example work out with the while-loop in the hyperparameter search which collects all logs. We may simply get stuck at this point?!