Open tonyyang-svail opened 5 years ago
Thanks for filing the issue! Yes, ElasticDL CLI is only responsible for submitting the job to the cluster and it’s non-blocking/async. It shouldn’t be hard to support sync call. I’ll take a look on this.
Copying from the discussion at https://github.com/sql-machine-learning/sqlflow/pull/966
I think "an API that can be called to check the job status" is better, since a training job could last weeks while the long connection can timeout.
Hi @terrytangyuan, may I ask what is the progress on this issue?
I haven’t got a chance to work on this yet. Some questions:
Currently, the EDL high-level API call is asynchronous, meaning that the command
python -m edl train ...
returns right after the job submission to k8s, without waiting for the completion of the training job. The only way to check the completion is throughkubectl
.This is not ideal because a user may want to start a training job and a subsequent prediction job in a single bash script like
So EDL needs to support submitting the job synchronously.
Sidenote: this feature is required by SQLFlow since a user is likely to write