redpanda-ai / Meerkat

Used for the Meerkat project
Other
1 stars 1 forks source link

Tensorflow distributed training #666

Open redpanda-ai opened 8 years ago

redpanda-ai commented 8 years ago

TensorFlow v0.8 offers a way to train in parallel. I would like to not only do this but place a web service in front of it.

Here are some useful resources:

  1. Distributed TensorFlow How-to _Definition of Done_
  2. Clean this up as a generic piece of software where multiple architectures for the CNN can be passed in.
  3. Write a POC to train a single model using at least 2 GPU instances working collaboratively.
  4. Write a prototype web service to launch this.
  5. Verify that the results are comparable to what we can build with a single GPU.
  6. Measure the speed increase for multiple instances to see what sort of scaling we discover (looks nearly linear according Google's Research Blog.
speakerjohnash commented 8 years ago

+1