jlewi commented 4 years ago

Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.

The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model. So I don't think we take into account whether a model is currently being trained. https://github.com/kubeflow/code-intelligence/blob/faeb65757214ac93259f417b81e9e2fedafaebda/Label_Microservice/go/cmd/automl/pkg/automl/automl.go#L101

My conjecture is the following happens

We launch a Tekton job to train the model
The notebook loads the data into AutoML which is a blocking operatin
The notebook initiates an AutoML training job but doesn't block until training is complete
- This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.

At this point

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.63

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

It looks like we need to also look at the datasets and see if there is a model training in progress.

jlewi commented 4 years ago

182 auto PR created for a model trained by manually running the notebook.

Need to verify that a new model is trained automatically and then deployed.

jlewi commented 4 years ago

kubeflow/code-intelligence#184 opened a PR to update to the same model. It doesn't look like a new model got trained.

kubeflow / code-intelligence