Closed Hiroki11x closed 7 years ago
just using file system
train script is runnning
INFO:resnet50_trainer:Finished iteration 1/1251 of epoch 0 (2.50 images/sec)
INFO:resnet50_trainer:Training loss: 7.48950910568, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/1251 of epoch 0 (21.13 images/sec)
INFO:resnet50_trainer:Training loss: 117.005592346, accuracy: 0.09375
INFO:resnet50_trainer:Finished iteration 3/1251 of epoch 0 (8.36 images/sec)
INFO:resnet50_trainer:Training loss: 369.184936523, accuracy: 0.125
INFO:resnet50_trainer:Finished iteration 4/1251 of epoch 0 (22.64 images/sec)
however
E0719 04:54:14.501235 156430 net_dag.cc:521] Operator chain failed: input: "allreduce_3_cw" input: "gpu_0/comp_15_spatbn_3_s_grad" input: "gpu_1/comp_15_spatbn_3_s_grad" input: "gpu_2/comp_15_spatbn_3_s_grad" input: "gpu_3/comp_15_spatbn_3_s_grad" output: "gpu_0/comp_15_spatbn_3_s_grad" output: "gpu_1/comp_15_spatbn_3_s_grad" output: "gpu_2/comp_15_spatbn_3_s_grad" output: "gpu_3/comp_15_spatbn_3_s_grad" name: "comp_15_spatbn_3_s_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_15_spatbn_3_s_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 04:54:14.501709 156423 net_dag.cc:521] Operator chain failed: input: "allreduce_2_cw" input: "gpu_0/comp_15_spatbn_3_b_grad" input: "gpu_1/comp_15_spatbn_3_b_grad" input: "gpu_2/comp_15_spatbn_3_b_grad" input: "gpu_3/comp_15_spatbn_3_b_grad" output: "gpu_0/comp_15_spatbn_3_b_grad" output: "gpu_1/comp_15_spatbn_3_b_grad" output: "gpu_2/comp_15_spatbn_3_b_grad" output: "gpu_3/comp_15_spatbn_3_b_grad" name: "comp_15_spatbn_3_b_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_15_spatbn_3_b_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0719 04:54:14.576787 156425 allreduce_ops.h:46] Caught gloo IO exception: [/home/hiroki11/caffe2/third_party/gloo/gloo/transport/tcp/buffer.cc:76] Read timeout [10.1.4.5]:38630
E0719 04:54:14.577260 156425 net_dag.cc:521] Operator chain failed: input: "allreduce_13_cw" input: "gpu_0/comp_14_conv_3_w_grad" input: "gpu_1/comp_14_conv_3_w_grad" input: "gpu_2/comp_14_conv_3_w_grad" input: "gpu_3/comp_14_conv_3_w_grad" output: "gpu_0/comp_14_conv_3_w_grad" output: "gpu_1/comp_14_conv_3_w_grad" output: "gpu_2/comp_14_conv_3_w_grad" output: "gpu_3/comp_14_conv_3_w_grad" name: "comp_14_conv_3_w_grad" type: "Allreduce" arg { name: "status_blob" s: "allreduce_comp_14_conv_3_w_grad_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
Traceback for operator 1759 in network resnet50
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:977
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
explog
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 138, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 201, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:817] success. Error running net resnet50
Traceback for operator 1756 in network resnet50
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:977
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
main()
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
Train(args)
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
explog
File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 138, in RunEpoch
workspace.RunNet(train_model.net.Proto().name)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 201, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:817] success. Error running net resnet50
ERROR:timeout_guard:Call did not finish in time. Timeout:60.0s PID: 153817
ERROR:timeout_guard:Process did not terminate cleanly in 10 s, forcing
change
-static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(30);
+static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(180);
I run distributed training without Redis, it is succceded.
when using redis
INFO:memonger:Memonger memory optimization took 0.350960969925 secs
E0721 05:35:01.964723 42033 common_world_ops.h:75] Caught store handler timeout exception: [/home/hiroki11/caffe2/caffe2/distributed/redis_store_handler.cc:110] Wait timeout for name(s): allreduce_0_cw_op/1/0
E0721 05:35:01.966565 42033 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0721 05:35:01.967056 42033 workspace.cc:217] Error when running network resnet50_init
Traceback (most recent call last):
File "resnet50_trainer.py", line 463, in <module>
main()
File "resnet50_trainer.py", line 459, in main
Train(args)
File "resnet50_trainer.py", line 351, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def).
https://github.com/caffe2/caffe2/blob/master/caffe2/distributed/redis_store_handler.cc#L110
if (timeout != kNoTimeout && elapsed > timeout) {
STORE_HANDLER_TIMEOUT("Wait timeout for name(s): ", Join(" ", names));
}
https://github.com/caffe2/caffe2/blob/master/caffe2/python/pybind_state.cc#L862
m.def("run_net_once", [](const py::bytes& net_def) {
CAFFE_ENFORCE(gWorkspace);
NetDef def;
CAFFE_ENFORCE(
ParseProtobufFromLargeString(net_def.cast<std::string>(), &def));
py::gil_scoped_release g;
CAFFE_ENFORCE(gWorkspace->RunNetOnce(def));
return true;
});
then , this function call RunNetOnce RunNetOnce calls CallWithExceptionIntercept
https://github.com/caffe2/caffe2/blob/master/caffe2/python/workspace.py#L179
def RunNetOnce(net):
return CallWithExceptionIntercept(
C.run_net_once,
C.Workspace.current._last_failed_op_net_position,
GetNetName(net),
StringifyProto(net),
)
CallWithExceptionIntercept is defined at caffe2/caffe2/python/workspace.py
https://github.com/caffe2/caffe2/blob/master/caffe2/python/workspace.py#L164
def CallWithExceptionIntercept(func, op_id_fetcher, net_name, *args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as ex:
op_id = op_id_fetcher()
net_tracebacks = operator_tracebacks.get(net_name, None)
print("Traceback for operator {} in network {}".format(op_id, net_name))
if net_tracebacks and op_id in net_tracebacks:
tb = net_tracebacks[op_id]
for line in tb:
print(':'.join(map(str, line)))
raise ex
class StoreHandler {
public:
-static constexpr std::chrono::milliseconds kDefaultTimeout =
- std::chrono::seconds(30);
+static constexpr std::chrono::milliseconds kDefaultTimeout =
+ std::chrono::seconds(180);
static constexpr std::chrono::milliseconds kNoTimeout =
std::chrono::milliseconds::zero();
l change defaulttimeout 30 -> 180
Traceback for operator 1069 in network resnet50_init
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:919
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:970
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
resnet50_trainer.py:310
resnet50_trainer.py:459
resnet50_trainer.py:463
Traceback (most recent call last):
File "resnet50_trainer.py", line 463, in <module>
main()
File "resnet50_trainer.py", line 459, in main
Train(args)
File "resnet50_trainer.py", line 351, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at redis_store_handler.cc:51] reply->integer == 1. 0 vs 1. Value at allreduce_0_cw_op/0/1 was already set (perhaps you reused a run ID you have used before?) Error from operator:
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
#!/bin/bash
for i in {0..3}
do
bsub \
-e error_file.log \
-o output_file.log \
-R rusage[ngpus_shared=4] \
-q excl python resnet50_trainer.py \
--train_data path-to/ilsvrc12_train_lmdb \
--gpus 0,1,2,3 \
--batch_size 128 \
--num_labels 10 \
--epoch_size 10240 \
--num_epochs 10 \
--num_shards 4 \
--shard_id 0 \
--run_id $i \
--redis_host XXXXXX \
--redis_port 6379
done
Traceback (most recent call last):
File "resnet50_trainer.py", line 463, in <module>
main()
File "resnet50_trainer.py", line 459, in main
Train(args)
File "resnet50_trainer.py", line 351, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
raise ex
RuntimeError: [enforce fail at redis_store_handler.cc:51] reply->integer == 1. 0 vs 1. Value at allreduce_0_cw_op/0/1 was already set (perhaps you reused a run ID you have used before?) Error from operator:
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
Redis Server Setup change the following fields in the redis.conf:
bind XX.XX.XX.XX
protected-mode no
daemonize yes
Here the bind IP is the redis server IP which is node 0. node 1 is the client. Then I started the server: redis-server redis.conf
On the client side, if I execute:
redis-cli -h XX.XX.XX.XX -p 6379
> ping
pong
It means they can be connected.
But if run the benchmark with the commands: server:
./benchmark --size 2 --rank 0 --redis-host XX.XX.XX.XX --redis-port 6379 --prefix 1300 --transport ibverbs --elements -1 allreduce_ring_chunk
client:
./benchmark --size 2 --rank 1 --redis-host XX.XX.XX.XX --redis-port 6379 --prefix 1300 --transport ibverbs --elements -1 allreduce_ring_chunk
Then it has the above error. But it's working correctly if "--transport tcp" is used.
run below script