Open DEKHTIARJonathan opened 5 years ago
Thanks for your comments. We have indeed used the TensorFlow benchmarks in our evaluation. We have experimented with both replicated
and parameter_server
(on the GPU) options for variable updates; and nccl
for all-reduce.
We will commit our TensorFlow experimental setup in due time. Our docker container should already be sufficient to re-run our ResNet-50 experiments. I will document this shortly too.
Thanks for your prompt answer ;) I will be very happy to provide you back the results I can obtain on 8x Tesla V100. I think the paper can be improved by providing more details on what was tested and how. Especially because this system aims to target efficiency and scalibility ;) So this part should be as detailled as possible ;)
Thanks a lot once again,
All the best
Experiments with ResNet-50 on 8x V100 certainly aligns with our course of action - I am about to give it a go. I am more than happy to share this setup with you.
Re: paper improvements, besides the variable update and all-reduce strategy used, what else would you consider missing from the experimental setup? Feel free to ping me with additional comments.
I will leave this issue open to inform you as I make progress with your requests.
thanks for the additional information, much appreciated.
If you want we can set up a call that way I can launch experiments with your help on DGX1 & DGX2 (8x Tesla V100 - 16GB - 8 x Tesla V100 - 32GB and 16 x Tesla V100 32GB)
Make sure that you use TFRecords imagenet, it allows to maximise throughtput Trying with FP16 is free performance for no accuracy cost in RN50, it will be more intensive on your solution (can it keep up with the load) ? And having a simple container that we can build/pull easily. And an exact command to reproduce the same setup as you did.
This kind of systems can really be interesting, interesting however you need to compare on all factors:
average CPU load: TF.Distributed/Horovod/Crossbow average RAM used: TF.Distributed/Horovod/Crossbow if multi-node, average network speed: TF.Distributed/Horovod/Crossbow
It's quite likely that your approach is much more intense on CPU/RAM for example as your launch more threads ;) So it's important to highlight that point or people may highlight that they cant reproduce because they dont have as much RAM as you.
metrics: imgs/sec seems the best performance proxy to measure throughput.
I genuinely think this is really interesting, and want to try it as soon as possible. However, the publication felt a little unclear at first read.
Hi,
Very interesting work. I have some remarks:
In your paper, you oftenly speak about "compared to Tensorflow", what kind of distribution strategy do you talk about ? What kind of strategy: xring ? nccl ?
I guess you have run the experiment on vanilla TF, could we see the code you used to collect these numbers ? Btw. if I understood correctly you used a non official and not the best implementation available, comparing with: https://github.com/tensorflow/benchmarks would be a lot more interesting. They offer different tf.distributed strategies + Horovod
And btw. if we can have a docker container and a script to tests your results with Corssbow, it could be interesting ;) => RN50 seems a decent benchmark as you pointed out ;)
Thanks for your help and congrats for this interesting project