Closed Hiroki11x closed 7 years ago
Traceback (most recent call last):
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 463, in <module>
main()
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 459, in main
Train(args)
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 389, in Train
explog
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 154, in RunEpoch
data_parallel_model.GetLearningRateBlobNames(train_model)[0]
AttributeError: 'module' object has no attribute 'GetLearningRateBlobNames'
it seems to depend on Caffe2 version.
Traceback (most recent call last):
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 463, in <module>
main()
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 459, in main
Train(args)
File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 351, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
StringifyProto(net),
File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 166, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at /home/hiroki11/caffe2/third_party/gloo/gloo/cuda_collectives_device.h:199] 1 == canAccessPeer. 1 vs 0. GPU 0 does not have peer access to GPU 2
this problem depends on IBM Minskey's architecture ?
BS=32
NUM_GPUS=2
NUM_SHARDS=2
BS_PER_NODE=`expr $BS '*' $NUM_GPUS`
TRAIN_DATA=/path-to/ilsvrc12_train_lmdb
python \
/home/hiroki11/models/train/src/resnet50_trainer.py \
--train_data $TRAIN_DATA \
--num_gpus $NUM_GPUS \
--num_shards $NUM_SHARDS \
--shard_id \$1 \
--file_store_path $RENDEZVOUS \
--image_size 224 \
--batch_size $BS_PER_NODE \
--epoch_size 1281167 \
--num_epochs 2 \
--run_id 1081 \
--base_learning_rate 1.0 \
--weight_decay 0.0001"
if I change NUM_GPUS (num of gpu per node) 2 -> 4.
training job is executed.
INFO:resnet50_trainer:Finished iteration 1428/10009 of epoch 1 (168.60 images/sec)
INFO:resnet50_trainer:Training loss: 0.512093663216, accuracy: 0.8125
INFO:resnet50_trainer:Finished iteration 1429/10009 of epoch 1 (168.86 images/sec)
INFO:resnet50_trainer:Training loss: 0.663730919361, accuracy: 0.75
It seems that gloo/cuda_collectives_device.h
did not originally exist in the first place. (last week)
Beyond Caffe 2, make an issue for Gloo or send it to PR?
it is caused by gloo update
[before] https://github.com/facebookincubator/gloo/tree/7ea9d9af4e82d20c7c6cee5edd3c52f9bcb42821 [after] https://github.com/facebookincubator/gloo/tree/530878247b04c423fd35477208f68e70b8126e2d
same issue https://github.com/caffe2/caffe2/issues/918
python resnet50_trainer.py \
--train_data ilsvrc12_train_lmdb \
--num_gpus 4 \
--num_shards 1 \
--file_store_path log/hoge \
--image_size 224 \
--batch_size 128 \ #32 * 4gpu
--epoch_size 1281167 \
--num_epochs 2 \
--base_learning_rate 1.0 \
--weight_decay 0.0001
1 node 4gpu training was working
$ cd $CAFFE2_HOME
$ git reset --hard a31338b1cd98befcd3668168c533c6c107e7460
HEAD is now at a31338b Support a build script for Tizen target
$ rm -rf build
$ make clean
$ cd third_party/gloo
$ git reset --hard 7ea9d9af4e82d20c7c6cee5edd3c52f9bcb42821
HEAD is now at 7ea9d9a Fix build when included by another project; take 2
$ cd $CAFFE2_HOME
$ vim third_party/gloo/gloo/transport/tcp/device.cc
- static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(30);
+ static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(180);
$ vim caffe2/distributed/store_handler.h
-static constexpr std::chrono::milliseconds kDefaultTimeout =
- std::chrono::seconds(30);
+static constexpr std::chrono::milliseconds kDefaultTimeout =
+ std::chrono::seconds(180);
$ ./cudnn_update.sh
$ mkdir build && cd build
$ CMAKE_PREFIX_PATH=/path-to/protobuf-3.2.0:/path-to/cuda_8_opencv-2.4.13:/path-to/snappy_1.1.4 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=ON \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DCUDNN_INCLUDE_DIR=/path-to/cuda/include \
-DCUDNN_LIBRARY=/path-to/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/path-to/caffe2/local \
-DMPI_C_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicc \
-DMPI_CXX_COMPILER=/path-tos/openmpi-2.0.1/xl/bin/mpicxx
$ make -j 128 install
build & install suceeded
4gpu per node * 2node training is worked. However, there is not stable.
INFO:resnet50_trainer:Finished iteration 980/5004 of epoch 0 (141.87 images/sec)
INFO:resnet50_trainer:Training loss: 1.72050333023, accuracy: 0.34375
INFO:resnet50_trainer:Finished iteration 981/5004 of epoch 0 (191.96 images/sec)
INFO:resnet50_trainer:Training loss: 1.78093731403, accuracy: 0.15625
INFO:resnet50_trainer:Finished iteration 982/5004 of epoch 0 (230.61 images/sec)
INFO:resnet50_trainer:Training loss: 1.73189878464, accuracy: 0.15625
INFO:resnet50_trainer:Finished iteration 983/5004 of epoch 0 (217.86 images/sec)
INFO:resnet50_trainer:Training loss: 1.74525475502, accuracy: 0.1875
INFO:resnet50_trainer:Finished iteration 984/5004 of epoch 0 (128.55 images/sec)
INFO:resnet50_trainer:Training loss: 1.71017813683, accuracy: 0.34375
INFO:resnet50_trainer:Finished iteration 985/5004 of epoch 0 (167.27 images/sec)
I found this issue
https://stackoverflow.com/questions/45299351/caffe2-obtain-learning-rate-cant-find-blob-gpu-0-conv1-w-lr
I think it is occured by difference of Caffe2 (resnet50_trainer.py) version
same issue
https://github.com/caffe2/caffe2/issues/616#issuecomment-321679877