rioyokotalab / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Other
2 stars 0 forks source link

Distributed Multinode Training Error #13

Closed Hiroki11x closed 7 years ago

Hiroki11x commented 7 years ago
Rank 0
INFO:resnet50_trainer:Running on GPUs: [0, 1, 2, 3]
INFO:resnet50_trainer:Using epoch size: 1281024
INFO:data_parallel_model:Parallelizing model for devices: [0, 1, 2, 3]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Model for GPU : 1
INFO:data_parallel_model:Model for GPU : 2
INFO:data_parallel_model:Model for GPU : 3
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed computed params all-reduce not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() ----- 
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.382014989853 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.402288913727 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.385862827301 secs
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Remapping 111 blobs, using 14 shared
INFO:memonger:Memonger memory optimization took 0.391633987427 secs
E0804 00:42:11.653461 80925 common_world_ops.h:75] Caught store handler timeout exception: [/home/hiroki11/caffe2/caffe2/distributed/file_store_handler.cc:132] Wait timeout for name(s): allreduce_3_cw_op/3/0
E0804 00:42:11.657723 80925 net.cc:145] Operator failed: input: "store_handler" output: "allreduce_3_cw" name: "allreduce_3_cw_op" type: "CreateCommonWorld" arg { name: "status_blob" s: "create_allreduce_cw_3_status" } arg { name: "rank" i: 0 } arg { name: "size" i: 4 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO"
E0804 00:42:11.658283 80925 workspace.cc:217] Error when running network resnet50_init
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback for operator 1072 in network resnet50_init
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:919
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:970
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:983
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:881
/home/hiroki11/caffe2/build/caffe2/python/data_parallel_model.py:221
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:309
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:458
/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py:462
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 350, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 175, in CallWithExceptionIntercept
    raise ex
RuntimeError: [enforce fail at pybind_state.cc:862] gWorkspace->RunNetOnce(def). 
Hiroki11x commented 7 years ago

It maybe occured by not determining run_id ? https://github.com/caffe2/caffe2/issues/984