I would like to have a subcommand that terminates all jobs associated with an experiment and removes all generated files/the trace of it. Otherwise one has to manually use qdel, scancel or gcloud compute instances delete. This could for example be mle abort <experiment_id> or simply mle abort with an additional user Q/A afterwards (check if the status of the experiment is running). A simple procedure could look as follows:
Print summary filtered by status being running and get experiment from cmd args or user.
Check if experiment_id is in db and status is running. Repeat Q if not.
Get job name from single_job_args.job_name in DB.
Delete all jobs starting with a job_name. This will depend on the resource.
Delete all files in experiment_dir.
[Maybe 1. instead] Set the experiment status to aborted in the DB and push it back to GCP.
Main problem: Grid search experiments launch new jobs based on job termination. How do we circumvent this?
Potential Solution: Update database between grid search batches and check if the status was set to aborted. If no: Update batch counter in database. If yes: Stop launching new jobs. Break out of hyperparameter run. This also has the advantage that we can also show the current batch iteration in mle monitor.
Also allow user to choose between termination via experiment config .yaml and experiment_id.
I would like to have a subcommand that terminates all jobs associated with an experiment and removes all generated files/the trace of it. Otherwise one has to manually use
qdel
,scancel
orgcloud compute instances delete
. This could for example bemle abort <experiment_id>
or simplymle abort
with an additional user Q/A afterwards (check if the status of the experiment isrunning
). A simple procedure could look as follows:running
and getexperiment
from cmd args or user.experiment_id
is in db and status isrunning
. Repeat Q if not.job name
fromsingle_job_args.job_name
in DB.job_name
. This will depend on the resource.experiment_dir
.aborted
in the DB and push it back to GCP.mle monitor
.Also allow user to choose between termination via experiment config
.yaml
andexperiment_id
.Note: Give credit to Tudor's Liftoff package.