tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.1k stars 74.27k forks source link

Doesn't inefficient and unfriendly for Distributed tensorflow for our model training? #8344

Closed jacob1017 closed 7 years ago

jacob1017 commented 7 years ago

If we want deployed tensorflow on our cluster, it is really inefficient in my opinion. As the official tutorial shows, how many task you have launched, then how many times you should run you program file on those nodes.

As our developers hope, tensorflow will be our Hadoop in Deep Learning. Hadoop to launch an job would be more convenient just execute once your job command.

Maybe I doesn't use this framework correctly, if you have any good ideas for this, we can discussed an nice solution and make our world beautiful.

drpngx commented 7 years ago

@jhseu maybe there is an ecosystem answer?

@rhaertel80 for CloudML.

jhseu commented 7 years ago

We consider job setup not to be a responsibility of TensorFlow core. It's more suited to other things, like Kubernetes, Mesos, or Spark.

Take a look at TensorFlow on Spark if you already have Spark cluster: https://github.com/yahoo/TensorFlowOnSpark

Or if you're running Kubernetes, you can use a configuration from https://github.com/tensorflow/ecosystem

jacob1017 commented 7 years ago

OK,this sounds reasonable, thanks @jhseu

rhaertel80 commented 7 years ago

If you're open to using cloud technologies, Google Cloud Machine Learning Engine is a managed service that automatically brings up and takes down nodes for running your TensorFlow jobs. For an example of how simple it can be to launch your distributed TensorFlow job, see the quickstart.