Closed agitter closed 1 year ago
As far as I can tell, the distributed TensorFlow video assumes you have hostnames and ports available for direct communication between machines. That is not possibly in the CHTC pool. Perhaps it would be in AWS or Cooley.
We do not know how to train a CNN on all images. One option would be to move all the data to Amazon and use a single multi-GPU instance. However, that would likely be cost prohibitive.
If we are forced to use Cooley or CHTC GPUs, we will need to think about how HTCondor can coordinate the training. HTCondor has a master worker framework that may be relevant. CHTC is offering training on it soon.
This caught my attention because distributed TensorFlow also discusses master-worker organization. There may be many ways to do this in TensorFlow though.