lsds / Crossbow

Crossbow: A Multi-GPU Deep Learning System for Training with Small Batch Sizes
Apache License 2.0
55 stars 6 forks source link

Testing & Benchmark #5

Open DEKHTIARJonathan opened 5 years ago

DEKHTIARJonathan commented 5 years ago

Hi,

Very interesting work. I have some remarks:

Thanks for your help and congrats for this interesting project

alexandroskoliousis commented 5 years ago

Thanks for your comments. We have indeed used the TensorFlow benchmarks in our evaluation. We have experimented with both replicated and parameter_server (on the GPU) options for variable updates; and nccl for all-reduce.

We will commit our TensorFlow experimental setup in due time. Our docker container should already be sufficient to re-run our ResNet-50 experiments. I will document this shortly too.

DEKHTIARJonathan commented 5 years ago

Thanks for your prompt answer ;) I will be very happy to provide you back the results I can obtain on 8x Tesla V100. I think the paper can be improved by providing more details on what was tested and how. Especially because this system aims to target efficiency and scalibility ;) So this part should be as detailled as possible ;)

Thanks a lot once again,

All the best

alexandroskoliousis commented 5 years ago

Experiments with ResNet-50 on 8x V100 certainly aligns with our course of action - I am about to give it a go. I am more than happy to share this setup with you.

Re: paper improvements, besides the variable update and all-reduce strategy used, what else would you consider missing from the experimental setup? Feel free to ping me with additional comments.

I will leave this issue open to inform you as I make progress with your requests.

DEKHTIARJonathan commented 5 years ago

thanks for the additional information, much appreciated.

If you want we can set up a call that way I can launch experiments with your help on DGX1 & DGX2 (8x Tesla V100 - 16GB - 8 x Tesla V100 - 32GB and 16 x Tesla V100 32GB)

Make sure that you use TFRecords imagenet, it allows to maximise throughtput Trying with FP16 is free performance for no accuracy cost in RN50, it will be more intensive on your solution (can it keep up with the load) ? And having a simple container that we can build/pull easily. And an exact command to reproduce the same setup as you did.


This kind of systems can really be interesting, interesting however you need to compare on all factors:

average CPU load: TF.Distributed/Horovod/Crossbow average RAM used: TF.Distributed/Horovod/Crossbow if multi-node, average network speed: TF.Distributed/Horovod/Crossbow

It's quite likely that your approach is much more intense on CPU/RAM for example as your launch more threads ;) So it's important to highlight that point or people may highlight that they cant reproduce because they dont have as much RAM as you.

metrics: imgs/sec seems the best performance proxy to measure throughput.

I genuinely think this is really interesting, and want to try it as soon as possible. However, the publication felt a little unclear at first read.