lluunn commented 6 years ago

We have a object detection example for distributed training https://github.com/kubeflow/examples/tree/master/object_detection and GPU serving, #154

They currently are using different models, but we should combine them into one example.

texasmichelle commented 6 years ago

Both examples focus on different techniques: distributed training and GPU serving. In the case of the example in object_detection, object detection appears to be an arbitrary choice. Using a different dataset or approach would not affect the ability to highlight distributed training. For #154, is that different? Would a different dataset or approach make sense for GPU serving?

I don't see strong justification for an additional E2E example to highlight distributed training since it could have been added to an existing E2E example instead with less effort. It doesn't make sense at this point to have two separate object detection examples, so I'm very much in favor of combining. We can highlight both distributed training and GPU serving, but they obvs need to use the same dataset.

As-is, there are a lot of manual steps in object_detection. My preference is to optimize for a single example with clean, straightforward instructions. #154 is closer to that than object_detection. What would it look like if we added distributed training to the model used in #154? Is that a published example in the same repo as the model? If not, let's smooth out the process in object_detection and add GPU serving to that approach.

texasmichelle commented 6 years ago

If we absolutely want to keep them separate, #154 could be filed under Component-focused. If we do that, we should beef up the serving instructions in object_detection since they're pretty sparse.

lluunn commented 6 years ago

In #154 , the model is also an arbitrary choice. It just demos well (can see the boxes on the image as the result), and highlights GPU serving. So on a second thought, why make it an e2e example if we want to highlight GPU serving?

And why do we get from object detection example given github issue example already has distributed training?

@jlewi @texasmichelle WDYT?

jlewi commented 6 years ago

145 is the original bug. We need an example that illustrates TF serving with GPUs.

To illustrate serving with GPUs we need a model for which using GPUs make sense. So an image model is an obvious choice.

The GH summarization example problem isn't a good choice. We are currently using Keras to serve it, its text data using RNN. So probably not a good choice for illustrating GPUs with TF serving.

Per #145 this is based on this blog post https://cloud.google.com/blog/big-data/2017/09/performing-prediction-with-tensorflow-object-detection-models-on-google-cloud-machine-learning-engine

Which also does training. So I think we can get this working with training pretty easily.

Actually it looks like both this example and the example in object_detection are using the same Oxford-IIIT Pets dataset.

So I think we can put these two pieces together to have a complete E2E example of training.

@texasmichelle How about this

GitHub Issue Summarization

Use this to show training with Keras and deploying with Seldon (not TF Serving)
- We never got distributed training working with Keras which is one reason we added the T2T path
Get rid of T2T for training and serving with TF

Object Detection use this to

Show distributed training
Serving with TF Serving CPU/GPU
Batch prediction?

/cc @royxue @ldcastell

royxue commented 6 years ago

I agree with the idea to combine these 2 parts together, it could make this example looks like a complete workflow.

Object detection provide detailed steps from create pvc to training, but maybe just too many yamls, it's better to reduce yaml numbers or using ksonnet as mentioned in #178

yixinshi commented 6 years ago

I am also working on an example using object detection for batch prediction. I don't have a strong opinion about if we should have training, TF-serving, and batch prediction in the same example workflow in this particular example. In some cases, users might just use existing models , such as from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md.

Also note that batch prediction doesn't need the hack to solve the "version" problem mentioned in https://github.com/kubeflow/examples/blob/master/object_detection/export_tf_graph.md.

ldcastell commented 6 years ago

Agree on having consolidated and cleaner examples. We can easily incorporate GPU (for training and serving) into the existing object_detection example once all the yamls be moved to a ksonnet app/prototype (#178). Same with batch predict
P.S sorry about all the yamls but since I'm just learning ksonnet for me it was faster/easier to use the yamls

texasmichelle commented 6 years ago

That all sounds good to me. I updated #175 to reflect the removal of t2t training.

Since it's valuable to have a t2t example, we can replace it with the code we've been using for onstage demos. #191 created for this.

texasmichelle commented 5 years ago

Can this be closed?

jlewi commented 5 years ago

I'm going to close this issue. I reread the thread and looked at the current code and I don't see any immediate work.

IUUC I think #154 added a TFServing example and may not have initially been using the model produced by the training code. But it looks like that the instructions now tell users they can use the model they trained. https://github.com/kubeflow/examples/blob/master/object_detection/tf_serving_gpu.md

I think the next step is to open up separate issues like #231 to add E2E tests to verify we can train a model and then serve it.

kubeflow / examples

Reconcile object detection examples #185

145 is the original bug. We need an example that illustrates TF serving with GPUs.