sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
734 stars 113 forks source link

Support on synchronous call on high level API #1285

Open tonyyang-svail opened 5 years ago

tonyyang-svail commented 5 years ago

Currently, the EDL high-level API call is asynchronous, meaning that the command python -m edl train ... returns right after the job submission to k8s, without waiting for the completion of the training job. The only way to check the completion is through kubectl.

This is not ideal because a user may want to start a training job and a subsequent prediction job in a single bash script like

python -m edl train ... # I return immediately
python -m edl pred ...  # hoops, I should wait for the training job to complete

So EDL needs to support submitting the job synchronously.

Sidenote: this feature is required by SQLFlow since a user is likely to write

SELECT ... TRAIN EDL.MODEL ...;
SELECT ... PREDICT EDL.MODEL ...;
terrytangyuan commented 5 years ago

Thanks for filing the issue! Yes, ElasticDL CLI is only responsible for submitting the job to the cluster and it’s non-blocking/async. It shouldn’t be hard to support sync call. I’ll take a look on this.

tonyyang-svail commented 5 years ago

Copying from the discussion at https://github.com/sql-machine-learning/sqlflow/pull/966

I think "an API that can be called to check the job status" is better, since a training job could last weeks while the long connection can timeout.

tonyyang-svail commented 5 years ago

Hi @terrytangyuan, may I ask what is the progress on this issue?

terrytangyuan commented 5 years ago

I haven’t got a chance to work on this yet. Some questions: