Open carpedm20 opened 7 years ago
Yeah, I agree, there are any good examples of Distributed Training out there. We use a slightly different configuration for distributed training internally (but based on the same code), so I haven't actually run distributed training on the open source version myself, I just know that it should work.
I'll need to spend a few days to write up a guide for that. I think it's pretty high priority so I'll try to do that in the next few days.
I want to know when this guide is ready? Thanks for your effort!
is this guide available somewhere? Thanks!
After I read some of the codes, it's hard to fully understand how distributed training works with the code. I guess 'Experiments' is a wrapper that deals with the distributed learning but I'm not sure about this because example scripts doesn't include a command for distributed traing like using 8 parameter sersers as written in the paper (correct me if I'm wrong). Usually distributed tensorflow codes have a key word like ps and worker but I can't find those in the code. Can you clarify this?
By the way, there are lots of great snippets that are usually hard to find from TensorFlow repos. Especially usage of hooking looks pretty useful for profiling and sampling without hurting training. Thanks for the great job!