Need for a better focus on details.

ghost commented 7 years ago

This issue can be taken as a feature-request or a request related to documentation. The high-performance benchmarking example is a good effort. However the code is very fused (combining distributed and multi-gpu examples in the same setting !!). Moreover the code is not properly documented and there is little to no information available on different aspects related to StagingArea ops and how to use them.

It would be worthwhile if efforts can be made to improve the related documentation and improve the code clarity. We are currently working on a very high performance training code but are quite crippled by these debilitating drawbacks.

ppwwyyxx commented 7 years ago

I have some code in tensorpack that does (steals) the same optimizations as in this benchmark, but separate the logic of loading data, defining model and variable update. This might help understand what's going on in the benchmark.

For example, the 3 variable update methods (replicated, parameter_server, distributed_replicated) are replicated in 3 separate tensorpack trainers sharing the same interface. (defined in training.py and distributed.py).

tfboyd commented 7 years ago

@ujjwal-researcher I really do completely understand. Sorry for the slow response. @ppwwyyxx's tensorpack is pretty cool. I found myself over there the other day. Unrelated to tensorpack, the go forward input pipeline will be based on DataSets which is in contrib and quickly move to core. The goal is to quickly get DataSets to match the performance of the tf_cnn_benchmark input pipeline.

I personally use DataSets, although I do not do "cool stuff" and I find it to be a much better interface than queues and performance has been better than Queues in general. Datasets has one "nob" which is setting num_threads and buffer size but I have not found that too hard to tweak.

Here is a full example training ResNet with CIFAR-10. For imagenet you would use a TFRecord reader and I do not have an example of that off the top of my head.

In A/B testing I found DataSets to be faster than queues when moving to multi-GPU and normally equal or slightly faster with one GPU.

qinglintian commented 7 years ago

@tfboyd I'm a little confused by the info below founded on TensorFlow Benchmark site. I put the quick question here as I believe this is something related to "detail". It's about the detailed settings for distributed training and it says:

"To simplify server setup, EC2 instances (p2.8xlarge) running worker servers also ran parameter servers."

What does that mean? How should I run the script? Currently, I'm doing distributed training by invoking the script in the following way. python tf_cnn_benchmarks.py --model=resnet_50 --variable_update=distributed_replicated --train_dir=some_path --job_name=ps/worker --task_index=0/1 --ps_hosts=ps_ip:port --worker_hosts=worker0_ip:port,worker1_ip:port But the performance I was getting is far less than single GPU training. According to "EC2 instances (p2.8xlarge) running worker servers also ran parameter servers", should I also start another process on each machine?

Thank you.

tfboyd commented 7 years ago

I means I ran the parameter servers on the worker servers rather than having them separate

My example has a lot of typos that I need to fix (and I am very sorry), our content management system is awful and it often takes me 30-60 days to make a change. Below is what should have been included on the performance models page as an example for distributed.

Below is an example of training ResNet-50 on 2 hosts: host_0 (10.0.0.1) and host_1 (10.0.0.2). The example uses synthetic data. To use real data pass the --data_dir argument.

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

qinglintian commented 7 years ago

Thank you for your prompt reply!

tensorflow / benchmarks

Need for a better focus on details. #27