mle-infrastructure / mle-toolbox

Lightweight Tool to Manage Distributed ML Experiments đź› 
https://mle-infrastructure.github.io/mle_toolbox/toolbox/
MIT License
2 stars 0 forks source link

New `mle abort` subcmd - Clean experiment termination #68

Open RobertTLange opened 3 years ago

RobertTLange commented 3 years ago

I would like to have a subcommand that terminates all jobs associated with an experiment and removes all generated files/the trace of it. Otherwise one has to manually use qdel, scancel or gcloud compute instances delete. This could for example be mle abort <experiment_id> or simply mle abort with an additional user Q/A afterwards (check if the status of the experiment is running). A simple procedure could look as follows:

  1. Print summary filtered by status being running and get experiment from cmd args or user.
  2. Check if experiment_id is in db and status is running. Repeat Q if not.
  3. Get job name from single_job_args.job_name in DB.
  4. Delete all jobs starting with a job_name. This will depend on the resource.
  5. Delete all files in experiment_dir.
  6. [Maybe 1. instead] Set the experiment status to aborted in the DB and push it back to GCP.

Also allow user to choose between termination via experiment config .yaml and experiment_id.

Note: Give credit to Tudor's Liftoff package.

RobertTLange commented 2 years ago

It would be great to have a keyboard interrupt wrapper that cleans up the protocol/VM instances. Have a look at this thread: https://stackoverflow.com/questions/1187970/how-to-exit-from-python-without-traceback