About distributed training

carpedm20 commented 7 years ago

After I read some of the codes, it's hard to fully understand how distributed training works with the code. I guess 'Experiments' is a wrapper that deals with the distributed learning but I'm not sure about this because example scripts doesn't include a command for distributed traing like using 8 parameter sersers as written in the paper (correct me if I'm wrong). Usually distributed tensorflow codes have a key word like ps and worker but I can't find those in the code. Can you clarify this?

By the way, there are lots of great snippets that are usually hard to find from TensorFlow repos. Especially usage of hooking looks pretty useful for profiling and sampling without hurting training. Thanks for the great job!

dennybritz commented 7 years ago

Yeah, I agree, there are any good examples of Distributed Training out there. We use a slightly different configuration for distributed training internally (but based on the same code), so I haven't actually run distributed training on the open source version myself, I just know that it should work.

I'll need to spend a few days to write up a guide for that. I think it's pretty high priority so I'll try to do that in the next few days.

kaizhigaosu commented 7 years ago

I want to know when this guide is ready? Thanks for your effort!

Sachin19 commented 7 years ago

is this guide available somewhere? Thanks!

google / seq2seq

About distributed training #125