microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

How to run Distributed TensorFlow on CPUs with different nodes assigned to PS and Workers #54

Closed awan-10 closed 6 years ago

awan-10 commented 6 years ago

Thank your for providing the TensorFlow examples. I have two questions.

  1. Is there a way to setup the node count for PS and Workers independently?

I am trying to use CPU based training so I don' want the Parameter Servers to share the resources of a Worker. Currently, the way Batch AI is assigning PS and Workers is basically with the workerCount parameter in the job.json file. It puts :2223 as the port for PS and :2222 for port as workers but the node is actually shared. Is there a way to decouple this?

  1. Are the GPU recipes portable for CPU based runs or do we need to modify the code? I find the only change to be specifying the tensorflow image in job.json and things seem to work fine.
llidev commented 6 years ago
  1. Currently, we do not support dedicated nodes for PS, but we are thinking about add this feature in future release.
  2. Yes, on CPU nodes, batch ai will automatically switch to CPU mode if you are using the non-GPU docker container.
awan-10 commented 6 years ago

@lliimsft - Thank you for this information.