How to run Distributed TensorFlow on CPUs with different nodes assigned to PS and Workers

awan-10 commented 6 years ago

Thank your for providing the TensorFlow examples. I have two questions.

Is there a way to setup the node count for PS and Workers independently?

I am trying to use CPU based training so I don' want the Parameter Servers to share the resources of a Worker. Currently, the way Batch AI is assigning PS and Workers is basically with the workerCount parameter in the job.json file. It puts :2223 as the port for PS and :2222 for port as workers but the node is actually shared. Is there a way to decouple this?

Are the GPU recipes portable for CPU based runs or do we need to modify the code? I find the only change to be specifying the tensorflow image in job.json and things seem to work fine.

llidev commented 6 years ago

Currently, we do not support dedicated nodes for PS, but we are thinking about add this feature in future release.
Yes, on CPU nodes, batch ai will automatically switch to CPU mode if you are using the non-GPU docker container.

awan-10 commented 6 years ago

@lliimsft - Thank you for this information.

microsoftarchive / BatchAI

How to run Distributed TensorFlow on CPUs with different nodes assigned to PS and Workers #54