rioyokotalab / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Other
2 stars 0 forks source link

Tips for Caffe2 ResNet50 Distributed Training #10

Closed Hiroki11x closed 7 years ago

Hiroki11x commented 7 years ago

run below script

#!/bin/bash
for i in {0..3}
do

bsub \
-e error_file.log \
-o output_file.log \
-R rusage[ngpus_shared=4] \-
q excl \
python ${CAFFE2_HOME}/caffe2/python/examples/resnet50_trainer.py \
--train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb \
--gpus 0,1,2,3 \
--batch_size 128 \
--num_labels 10 \
--epoch_size 10240 \
--num_epochs 10 \
--num_shards 4 \
--shard_id $i \
--redis_host XXXXXX --redis_port 6379

done
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 236, in Train
prefix=args.run_id,
File "/path-to/caffe2/build/caffe2/python/core.py", line 324, in CreateOperator
operator.arg.add().CopyFrom(utils.MakeArgument(key, value))
File "/path-to/caffe2/build/caffe2/python/utils.py", line 128, in MakeArgument
key, value, type(value)
ValueError: Unknown argument type: key=prefix value=None, value type=<type 'NoneType'>
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 10240
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.335288047791 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.353416204453 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.340279817581 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.345367908478 secs
E0719 03:57:34.776022 145363 common_world_ops.h:75] Caught store handler timeout exception: [/path-to/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_0_cw_op/1/0
E0719 03:57:34.777902 145363 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 03:57:34.778396 145363 workspace.cc:217] Error when running network resnet50_init
Traceback (most recent call last):
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/path-to/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).
Sender: LSF System <lsfadmin@c460c110.c460cluster.net>
Subject: Job 327128: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 0 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             29.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   4 sec.
Turnaround time :                            5 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.

PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <lsfadmin@c460c055.c460cluster.net>
Subject: Job 327130: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c055.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 2 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             29.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   3 sec.
Turnaround time :                            6 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.

PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <lsfadmin@c460c041.c460cluster.net>
Subject: Job 327129: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c041.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 1 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.99 sec.
Max Memory :                                 29 MB
Average Memory :                             1.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   3 sec.
Turnaround time :                            6 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.

PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <lsfadmin@c460c110.c460cluster.net>
Subject: Job 327131: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c110.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4 --shard_id 3 --redis_host xxx.xxx.xxx.xxx --redis_port 6379
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   0.90 sec.
Max Memory :                                 36 MB
Average Memory :                             36.00 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                5
Run time :                                   2 sec.
Turnaround time :                            9 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.

PS:

Read file <error_file.log> for stderr output of this job.

Sender: LSF System <lsfadmin@c460c143.c460cluster.net>
Subject: Job 327134: <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> in cluster <gargblsf> Exited

Job <python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4> was submitted from host <c460login01.c460cluster.net> by user <hiroki11> in cluster <gargblsf>.
Job was executed on host(s) <c460c143.c460cluster.net>, in queue <excl>, as user <hiroki11> in cluster <gargblsf>.
</path-to> was used as the home directory.
</path-to/models/train/redis_multi> was used as the working directory.
Started at Results reported on
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python /path-to/caffe2/caffe2/python/examples/resnet50_trainer.py --train_data /path-to/ILSVRC2012_img_10_categories/ilsvrc12_train_lmdb --gpus 0,1,2,3 --batch_size 128 --num_labels 10 --epoch_size 10240 --num_epochs 10 --num_shards 4
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

CPU time :                                   9.58 sec.
Max Memory :                                 325 MB
Average Memory :                             249.67 MB
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              4
Max Threads :                                11
Run time :                                   44 sec.
Turnaround time :                            44 sec.

The output (if any) follows:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1069 in network resnet50_init
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:919
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:970
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:983
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:881
/path-to/caffe2/build/caffe2/python/data_parallel_model.py:221
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/path-to/caffe2/caffe2/python/examples/resnet50_trainer.py:462

PS:

Read file <error_file.log> for stderr output of this job.
Hiroki11x commented 7 years ago

just using file system

train script is runnning

INFO:resnet50_trainer:Finished iteration 1/1251 of epoch 0 (2.50 images/sec)
INFO:resnet50_trainer:Training loss: 7.48950910568, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/1251 of epoch 0 (21.13 images/sec)
INFO:resnet50_trainer:Training loss: 117.005592346, accuracy: 0.09375
INFO:resnet50_trainer:Finished iteration 3/1251 of epoch 0 (8.36 images/sec)
INFO:resnet50_trainer:Training loss: 369.184936523, accuracy: 0.125
INFO:resnet50_trainer:Finished iteration 4/1251 of epoch 0 (22.64 images/sec)

however


E0719 04:54:14.501235 156430 net_dag.cc:521] Operator chain failed: input: "allreduce_3_cw" input: "gpu_0/comp_15_spatbn_3_s_grad" input: "gpu_1/comp_15_spatbn_3_s_grad" input: "gpu_2/comp_15_spatbn_3_s_grad" input: "gpu_3/comp_15_spatbn_3_s_grad" output: "gpu_0/comp_15_spatbn_3_s_grad" output: "gpu_1/comp_15_spatbn_3_s_grad" output: "gpu_2/comp_15_spatbn_3_s_grad" output: "gpu_3/comp_15_spatbn_3_s_grad" name: "comp_15_spatbn_3_s_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_15_spatbn_3_s_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 04:54:14.501709 156423 net_dag.cc:521] Operator chain failed: input: "allreduce_2_cw" input: "gpu_0/comp_15_spatbn_3_b_grad" input: "gpu_1/comp_15_spatbn_3_b_grad" input: "gpu_2/comp_15_spatbn_3_b_grad" input: "gpu_3/comp_15_spatbn_3_b_grad" output: "gpu_0/comp_15_spatbn_3_b_grad" output: "gpu_1/comp_15_spatbn_3_b_grad" output: "gpu_2/comp_15_spatbn_3_b_grad" output: "gpu_3/comp_15_spatbn_3_b_grad" name: "comp_15_spatbn_3_b_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_15_spatbn_3_b_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"

E0719 04:54:14.576787 156425 allreduce_ops.h:46] Caught gloo IO exception: [/home/hiroki11/caffe2/third_party/gloo/gloo/transport/tcp/buffer.cc:76] Read timeout [10.1.4.5]:38630
E0719 04:54:14.577260 156425 net_dag.cc:521] Operator chain failed: input: "allreduce_13_cw" input: "gpu_0/comp_14_conv_3_w_grad" input: "gpu_1/comp_14_conv_3_w_grad" input: "gpu_2/comp_14_conv_3_w_grad" input: "gpu_3/comp_14_conv_3_w_grad" output: "gpu_0/comp_14_conv_3_w_grad" output: "gpu_1/comp_14_conv_3_w_grad" output: "gpu_2/comp_14_conv_3_w_grad" output: "gpu_3/comp_14_conv_3_w_grad" name: "comp_14_conv_3_w_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_14_conv_3_w_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
Traceback for operator 1759 in network resnet50
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:977
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
    explog
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 138, in RunEpoch
    workspace.RunNet(train_model.net.Proto().name)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 201, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at pybind_state.cc:817] success. Error running net resnet50 
Traceback for operator 1756 in network resnet50
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:977
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
    explog
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 138, in RunEpoch
    workspace.RunNet(train_model.net.Proto().name)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 201, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at pybind_state.cc:817] success. Error running net resnet50 
ERROR:timeout_guard:Call did not finish in time. Timeout:60.0s PID: 153817
ERROR:timeout_guard:Process did not terminate cleanly in 10 s, forcing
Hiroki11x commented 7 years ago

change

-static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(30);
+static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(180);

https://github.com/facebookincubator/gloo/blob/7ea9d9af4e82d20c7c6cee5edd3c52f9bcb42821/gloo/transport/tcp/device.cc#L30

I run distributed training without Redis, it is succceded.

Hiroki11x commented 7 years ago

when using redis

INFO:memonger:Memonger memory optimization took 0.350960969925 secs
E0721 05:35:01.964723 42033 common_world_ops.h:75] Caught store handler timeout exception: [/home/hiroki11/caffe2/caffe2/distributed/redis_store_handler.cc:110] Wait timeout for name(s): allreduce_0_cw_op/1/0
E0721 05:35:01.966565 42033 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0721 05:35:01.967056 42033 workspace.cc:217] Error when running network resnet50_init
Traceback (most recent call last):
  File "resnet50_trainer.py", line 463, in <module>
    main()
  File "resnet50_trainer.py", line 459, in main
    Train(args)
  File "resnet50_trainer.py", line 351, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).  

https://github.com/caffe2/caffe2/blob/master/caffe2/distributed/redis_store_handler.cc#L110

 if (timeout != kNoTimeout && elapsed > timeout) {
      STORE_HANDLER_TIMEOUT("Wait timeout for name(s): ", Join(" ", names));
    }
Hiroki11x commented 7 years ago

https://github.com/caffe2/caffe2/blob/master/caffe2/python/pybind_state.cc#L862

m.def("run_net_once", [](const py::bytes& net_def) {
    CAFFE_ENFORCE(gWorkspace);
    NetDef def;
    CAFFE_ENFORCE(
        ParseProtobufFromLargeString(net_def.cast<std::string>(), &def));
    py::gil_scoped_release g;
    CAFFE_ENFORCE(gWorkspace->RunNetOnce(def));
    return true;
  });

then , this function call RunNetOnce RunNetOnce calls CallWithExceptionIntercept

https://github.com/caffe2/caffe2/blob/master/caffe2/python/workspace.py#L179

def RunNetOnce(net):
    return CallWithExceptionIntercept(
        C.run_net_once,
        C.Workspace.current._last_failed_op_net_position,
        GetNetName(net),
        StringifyProto(net),
    )

CallWithExceptionIntercept is defined at caffe2/caffe2/python/workspace.py https://github.com/caffe2/caffe2/blob/master/caffe2/python/workspace.py#L164

def CallWithExceptionIntercept(func, op_id_fetcher, net_name, *args, **kwargs):
    try:
        return func(*args, **kwargs)
    except Exception as ex:
        op_id = op_id_fetcher()
        net_tracebacks = operator_tracebacks.get(net_name, None)
        print("Traceback for operator {} in network {}".format(op_id, net_name))
        if net_tracebacks and op_id in net_tracebacks:
            tb = net_tracebacks[op_id]
            for line in tb:
                print(':'.join(map(str, line)))
        raise ex
Hiroki11x commented 7 years ago

https://github.com/caffe2/caffe2/blob/bd69f2a75f05e0b9f645b9703fd4fe072a729377/caffe2/distributed/store_handler.h#L14

class StoreHandler {
 public:
-static constexpr std::chrono::milliseconds kDefaultTimeout =
-      std::chrono::seconds(30);
+static constexpr std::chrono::milliseconds kDefaultTimeout =
+      std::chrono::seconds(180);
  static constexpr std::chrono::milliseconds kNoTimeout =
      std::chrono::milliseconds::zero();

l change defaulttimeout 30 -> 180

Hiroki11x commented 7 years ago
Traceback for operator 1069 in network resnet50_init
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:919
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:970
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
resnet50_trainer.py:310
resnet50_trainer.py:459
resnet50_trainer.py:463
Traceback (most recent call last):
  File "resnet50_trainer.py", line 463, in <module>
    main()
  File "resnet50_trainer.py", line 459, in main
    Train(args)
  File "resnet50_trainer.py", line 351, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at redis_store_handler.cc:51] reply->integer == 1. 0 vs 1. Value at allreduce_0_cw_op/0/1 was already set (perhaps you reused a run ID you have used before?) Error from operator: 
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
Hiroki11x commented 7 years ago
#!/bin/bash
for i in {0..3}
do

bsub \
-e error_file.log \
-o output_file.log \
-R rusage[ngpus_shared=4] \
-q excl python resnet50_trainer.py \
--train_data path-to/ilsvrc12_train_lmdb \
--gpus 0,1,2,3 \
--batch_size 128 \
--num_labels 10 \
--epoch_size 10240 \
--num_epochs 10 \
--num_shards 4 \
--shard_id 0 \
--run_id $i \
--redis_host XXXXXX \
--redis_port 6379

done
Traceback (most recent call last):
  File "resnet50_trainer.py", line 463, in <module>
    main()
  File "resnet50_trainer.py", line 459, in main
    Train(args)
  File "resnet50_trainer.py", line 351, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at redis_store_handler.cc:51] reply->integer == 1. 0 vs 1. Value at allreduce_0_cw_op/0/1 was already set (perhaps you reused a run ID you have used before?) Error from operator: 
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
Hiroki11x commented 7 years ago

Redis Server Setup change the following fields in the redis.conf:

bind XX.XX.XX.XX
protected-mode no
daemonize yes

Here the bind IP is the redis server IP which is node 0. node 1 is the client. Then I started the server: redis-server redis.conf

On the client side, if I execute:

redis-cli -h XX.XX.XX.XX -p 6379
> ping
pong

It means they can be connected.

But if run the benchmark with the commands: server:

./benchmark --size 2 --rank 0 --redis-host XX.XX.XX.XX --redis-port 6379 --prefix 1300 --transport ibverbs --elements -1 allreduce_ring_chunk
client:
./benchmark --size 2 --rank 1 --redis-host XX.XX.XX.XX --redis-port 6379 --prefix 1300 --transport ibverbs --elements -1 allreduce_ring_chunk

Then it has the above error. But it's working correctly if "--transport tcp" is used.