Gloo update problem caused by topology of IBM Power System S822LC for High Performance Computing ("Minsky")

Hiroki11x commented 7 years ago

INFO:resnet50_trainer:Finished iteration 2501/2502 of epoch 0 (79.03 images/sec)
INFO:resnet50_trainer:Training loss: 0.432902753353, accuracy: 0.875
INFO:resnet50_trainer:Finished iteration 2502/2502 of epoch 0 (79.26 images/sec)
INFO:resnet50_trainer:Training loss: 0.462416082621, accuracy: 0.8125
Traceback (most recent call last):
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 462, in <module>
    main()
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 458, in main
    Train(args)
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 388, in Train
    explog
  File "/home/hiroki11/caffe2/caffe2/python/examples/resnet50_trainer.py", line 156, in RunEpoch
    learning_rate = workspace.FetchBlob(prefix + '/conv1_w_lr')
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 323, in FetchBlob
    return C.fetch_blob(StringifyBlobName(name))
RuntimeError: [enforce fail at pybind_state.cc:152] ws->HasBlob(name). Can't find blob: gpu_0/conv1_w_lr

I found this issue

https://stackoverflow.com/questions/45299351/caffe2-obtain-learning-rate-cant-find-blob-gpu-0-conv1-w-lr

I think it is occured by difference of Caffe2 (resnet50_trainer.py) version

same issue

https://github.com/caffe2/caffe2/issues/616#issuecomment-321679877

Hiroki11x commented 7 years ago

Traceback (most recent call last):
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 463, in <module>
    main()
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 459, in main
    Train(args)
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 389, in Train
    explog
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 154, in RunEpoch
    data_parallel_model.GetLearningRateBlobNames(train_model)[0]
AttributeError: 'module' object has no attribute 'GetLearningRateBlobNames'

it seems to depend on Caffe2 version.

Hiroki11x commented 7 years ago

Traceback (most recent call last):
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 463, in <module>
    main()
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 459, in main
    Train(args)
  File "/home/hiroki11/models/train/src/resnet50_trainer.py", line 351, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 183, in RunNetOnce
    StringifyProto(net),
  File "/home/hiroki11/caffe2/build/caffe2/python/workspace.py", line 166, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at /home/hiroki11/caffe2/third_party/gloo/gloo/cuda_collectives_device.h:199] 1 == canAccessPeer. 1 vs 0. GPU 0 does not have peer access to GPU 2

this problem depends on IBM Minskey's architecture ?

Hiroki11x commented 7 years ago

BS=32
NUM_GPUS=2
NUM_SHARDS=2
BS_PER_NODE=`expr $BS '*' $NUM_GPUS`
TRAIN_DATA=/path-to/ilsvrc12_train_lmdb

python \
          /home/hiroki11/models/train/src/resnet50_trainer.py  \
          --train_data $TRAIN_DATA \
          --num_gpus $NUM_GPUS \
          --num_shards $NUM_SHARDS \
          --shard_id \$1 \
          --file_store_path $RENDEZVOUS \
          --image_size 224 \
          --batch_size $BS_PER_NODE \
          --epoch_size 1281167 \
          --num_epochs 2 \
      --run_id 1081 \
          --base_learning_rate 1.0 \
          --weight_decay 0.0001"

if I change NUM_GPUS (num of gpu per node) 2 -> 4.

training job is executed.

INFO:resnet50_trainer:Finished iteration 1428/10009 of epoch 1 (168.60 images/sec)
INFO:resnet50_trainer:Training loss: 0.512093663216, accuracy: 0.8125
INFO:resnet50_trainer:Finished iteration 1429/10009 of epoch 1 (168.86 images/sec)
INFO:resnet50_trainer:Training loss: 0.663730919361, accuracy: 0.75

Hiroki11x commented 7 years ago

It seems that gloo/cuda_collectives_device.h did not originally exist in the first place. (last week) Beyond Caffe 2, make an issue for Gloo or send it to PR?

https://github.com/caffe2/caffe2/commit/3a31edb1bfcae07dff2194f044382a197584fe95#diff-15230e5dad94ffaefa7b526c0c1e89e5

Hiroki11x commented 7 years ago

it is caused by gloo update

[before] https://github.com/facebookincubator/gloo/tree/7ea9d9af4e82d20c7c6cee5edd3c52f9bcb42821 [after] https://github.com/facebookincubator/gloo/tree/530878247b04c423fd35477208f68e70b8126e2d

by PR of https://github.com/caffe2/caffe2/pull/1023

Hiroki11x commented 7 years ago

same issue https://github.com/caffe2/caffe2/issues/918

Hiroki11x commented 7 years ago

python   resnet50_trainer.py  \
--train_data ilsvrc12_train_lmdb \
--num_gpus 4  \
--num_shards 1  \
--file_store_path log/hoge \
--image_size 224   \
--batch_size 128   \  #32 * 4gpu
--epoch_size 1281167  \
--num_epochs 2 \
--base_learning_rate 1.0 \
--weight_decay 0.0001

1 node 4gpu training was working

Hiroki11x commented 7 years ago

$ cd $CAFFE2_HOME
$ git reset --hard a31338b1cd98befcd3668168c533c6c107e7460
HEAD is now at a31338b Support a build script for Tizen target
$ rm -rf build
$ make clean
$ cd third_party/gloo
$ git reset --hard 7ea9d9af4e82d20c7c6cee5edd3c52f9bcb42821
HEAD is now at 7ea9d9a Fix build when included by another project; take 2
$ cd $CAFFE2_HOME
$ vim third_party/gloo/gloo/transport/tcp/device.cc

- static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(30);
+ static const std::chrono::seconds kTimeoutDefault = std::chrono::seconds(180); 

$ vim caffe2/distributed/store_handler.h 

-static constexpr std::chrono::milliseconds kDefaultTimeout =
-      std::chrono::seconds(30);
+static constexpr std::chrono::milliseconds kDefaultTimeout =
+      std::chrono::seconds(180);

$ ./cudnn_update.sh
$ mkdir build && cd build
$ CMAKE_PREFIX_PATH=/path-to/protobuf-3.2.0:/path-to/cuda_8_opencv-2.4.13:/path-to/snappy_1.1.4 cmake .. \
-DBLAS=Eigen \
-DUSE_CUDA=ON \
-DUSE_ROCKSDB=OFF \
-DUSE_GLOO=ON \
-DUSE_REDIS=ON \
-DUSE_OPENCV=ON \
-DUSE_GFLAGS=OFF \
-DCUDNN_INCLUDE_DIR=/path-to/cuda/include \
-DCUDNN_LIBRARY=/path-to/cuda/lib/libcudnn.so \
-DCMAKE_INSTALL_PREFIX=/path-to/caffe2/local \
-DMPI_C_COMPILER=/path-to/openmpi-2.0.1/xl/bin/mpicc \
-DMPI_CXX_COMPILER=/path-tos/openmpi-2.0.1/xl/bin/mpicxx
$ make -j 128 install

build & install suceeded

Hiroki11x commented 7 years ago

4gpu per node * 2node training is worked. However, there is not stable.

INFO:resnet50_trainer:Finished iteration 980/5004 of epoch 0 (141.87 images/sec)
INFO:resnet50_trainer:Training loss: 1.72050333023, accuracy: 0.34375
INFO:resnet50_trainer:Finished iteration 981/5004 of epoch 0 (191.96 images/sec)
INFO:resnet50_trainer:Training loss: 1.78093731403, accuracy: 0.15625
INFO:resnet50_trainer:Finished iteration 982/5004 of epoch 0 (230.61 images/sec)
INFO:resnet50_trainer:Training loss: 1.73189878464, accuracy: 0.15625
INFO:resnet50_trainer:Finished iteration 983/5004 of epoch 0 (217.86 images/sec)
INFO:resnet50_trainer:Training loss: 1.74525475502, accuracy: 0.1875
INFO:resnet50_trainer:Finished iteration 984/5004 of epoch 0 (128.55 images/sec)
INFO:resnet50_trainer:Training loss: 1.71017813683, accuracy: 0.34375
INFO:resnet50_trainer:Finished iteration 985/5004 of epoch 0 (167.27 images/sec)

rioyokotalab / caffe2

Gloo update problem caused by topology of IBM Power System S822LC for High Performance Computing ("Minsky") #14