Extending training example for model loading and prediction

mkgray commented 5 years ago

The mnist_distributed example currently provides an example for how to train a model with persistence in the checkpoint_dir specified, however this example does not go further to show how to load and use the model for inference/prediction. It would be highly beneficial to extend this example to show how this model can be deployed for prediction purposes.

oliverhu commented 5 years ago

@mkgray do you mean offline inference or online inference?

mkgray commented 5 years ago

@oliverhu ideally online inference. Admittedly I'm not well versed in core Tensorflow itself and I'm working on that; but the biggest issue facing me as a generalized user is a generalized example consisting of two parts:

Part 1: Training and Retraining from Checkpoints Loads data from a cloud provider or points to a distributed set of files, and trains a deep learning model on these files with model saved in the cloud. As added bonus, this file should be capable of being rerun to load the checkpointed model and further train on what had previously been output if run before.

Part 2: Inference Loads the model as stored in the cloud, with data for inference stored in the cloud. Output is a set of predictions on the data also stored in the cloud.

While the current MNIST example provides a large portion of this functionality; for me personally, I've spent the majority of my time investigating how the MonitoredTrainingSession and CheckPointSaverHook objects work in Tensorflow, as I've been trying to find a way of loading and running inference on a saved model on my local machine.

My next step is to tie into Cloud data, which expands on the input side of this example. While MNIST is a great starter example for CNN's and shows basic functionality of the program as applied in TonY, I believe a generalized example of using partitioned cloud data is more true to the usage of TonY. An ideal simple scenario in my mind which might be manageable as an example, is to create a test DataFrame (Fisher Iris should more than suffice), point a DNN to a subset of columns within the DataFrame as features and train through TonY.

It's possible that these issues I'm listing are easily overlooked by someone with in depth Tensorflow experience, as I believe that I can find out how to solve them given enough time spent in Tensorflow, but I strongly believe this package has significant potential and by removing these barriers to entry for the end user it may help drive adoption of the project.

Let me know your thoughts, or point me in the direction of any materials I can read up on if it's easier to integrate than I am thinking, and I've overlooked something.

oliverhu commented 5 years ago

That makes sense. @gogasca is this complicated example something we could provide a codelab in Google Cloud?

For inference, I think your example is an offline scoring, it is possible and I agree we should have a sample of that, we will get an example up for that.

mkgray commented 5 years ago

@oliverhu you're correct about the offline scoring and I mixed up my definitions, my apologies. I meant to refer to offline batch scoring of data, which happened to be stored on cloud. Thank you

gogasca commented 5 years ago

@oliverhu Yes, we can provide an example in how to do this in cloud (Training+Inference) via codelab/Colab notebook.

One of the additions to MNIST sample coming are the following:

When creating a Dataproc cluster, we can write the MNIST model results in GCS (Compared to /tmp). Once your SavedModel is in GCS we can perform inferences using separate infrastructure (GCE or Cloud ML Engine) Here is a sample in how to use a Saved model and predictions in < 5 minutes. https://cloud.google.com/blog/products/ai-machine-learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/dlvm/tools/scripts/setup.sh
Support for new TF version 1.13 (today we release 1.13rc2, so we are couple of weeks of having a GA version), MNIST will get some improvements. (TensorBoard changes, use of tf.data)
@mkgray Feel free to send a PR with other samples to increase our catalog for TonY. Are you looking for Distributed training in specific?

mkgray commented 5 years ago

@gogasca I was under the impression that both training and inference were capable of being handled in Dataproc, but I see you are suggesting that inference be performed on a separate architecture.

If my understanding of your documents is correct, I can see the advantages of hosting a couple complex models across a small set of GPU enabled nodes via REST API call; but my original plan is to distribute the pretrained model across all nodes in Spark for inference as part of a Spark job. Is there any reason this would not work using TonY in its current state, and if TonY were to support what I am looking for, is there any reason you seem to not recommend running inference as part of a Spark job using Dataproc?

erwa commented 5 years ago

Hi @mkgray ,

Just to clarify, you can't launch TonY jobs from Spark. TonY jobs run their own TonY Application Master that orchestrates the job, just as Spark has its own Spark AM.

You could run an inference job using TonY if you write a TensorFlow job that loads and serves the model. If you wanted to do it through Spark, you'd have to use a library (e.g.: https://github.com/databricks/spark-deep-learning) to load the TensorFlow models and integrate with an existing Spark job.

tony-framework / TonY

Extending training example for model loading and prediction #186