Closed michelwandermaas closed 7 years ago
I have a similar issue when trying to run mpi-caffe CIFAR10 example:
$ cd /opt/mpi-caffe $ ./data/cifar10/get_cifar10.sh $ ./examples/cifar10/create_cifar10.sh
[OK] Running Caffe example: $ ./examples/cifar10/train_quick.sh
Running mpi-caffe example:
$ mpirun -np 3 caffe train --solver=examples/cifar10-mpi/cifar10_mpi_solver.prototxt --gpu=0,1,2
I0819 11:11:32.198472 48673 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:32.198756 48674 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:32.198726 48675 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:38.527840 48674 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.529520 48675 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.529727 48674 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.529996 48673 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.531278 48675 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.531546 48674 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:38.531757 48673 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.532991 48675 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:38.533526 48673 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:39.190536 48675 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
level: 0
stage: ""
}
snapshot_format: HDF5
I0819 11:11:39.191018 48675 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
F0819 11:11:39.191181 48675 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
*** Check failure stack trace: ***
I0819 11:11:39.191946 48674 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
level: 0
stage: ""
}
snapshot_format: HDF5
@ 0x7fd571472e6d (unknown)
I0819 11:11:39.192312 48674 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
@ 0x7fd571474ced (unknown)
F0819 11:11:39.192720 48674 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
@ 0x7fd571472a5c (unknown)
*** Check failure stack trace: ***
@ 0x7fd57147563e (unknown)
@ 0x7f545b044e6d (unknown)
I0819 11:11:39.193928 48673 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
level: 0
stage: ""
}
snapshot_format: HDF5
@ 0x7f545b046ced (unknown)
I0819 11:11:39.194243 48673 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
@ 0x7fd576d4ce0e caffe::ReadNetParamsFromTextFileOrDie()
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
@ 0x7f545b044a5c (unknown)
F0819 11:11:39.194380 48673 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
*** Check failure stack trace: ***
@ 0x7fd576d080ab caffe::Solver<>::InitTrainNet()
@ 0x7f545b04763e (unknown)
@ 0x7f37fa5fce6d (unknown)
@ 0x7fd576d0917c caffe::Solver<>::Init()
@ 0x7f546091ee0e caffe::ReadNetParamsFromTextFileOrDie()
@ 0x7f37fa5feced (unknown)
@ 0x7fd576d094aa caffe::Solver<>::Solver()
@ 0x7f37fa5fca5c (unknown)
@ 0x7f54608da0ab caffe::Solver<>::InitTrainNet()
@ 0x7f37fa5ff63e (unknown)
@ 0x7fd576d21113 caffe::Creator_SGDSolver<>()
@ 0x40f4ae caffe::SolverRegistry<>::CreateSolver()
@ 0x408512 train()
@ 0x405f9c main
@ 0x7f54608db17c caffe::Solver<>::Init()
@ 0x7fd56803fb15 __libc_start_main
@ 0x7f37ffed6e0e caffe::ReadNetParamsFromTextFileOrDie()
@ 0x40680d (unknown)
@ 0x7f54608db4aa caffe::Solver<>::Solver()
@ 0x7f37ffe920ab caffe::Solver<>::InitTrainNet()
@ 0x7f54608f3113 caffe::Creator_SGDSolver<>()
@ 0x40f4ae caffe::SolverRegistry<>::CreateSolver()
@ 0x408512 train()
@ 0x405f9c main
@ 0x7f37ffe9317c caffe::Solver<>::Init()
@ 0x7f5451c11b15 __libc_start_main
@ 0x40680d (unknown)
@ 0x7f37ffe934aa caffe::Solver<>::Solver()
@ 0x7f37ffeab113 caffe::Creator_SGDSolver<>()
@ 0x40f4ae caffe::SolverRegistry<>::CreateSolver()
@ 0x408512 train()
@ 0x405f9c main
@ 0x7f37f11c9b15 __libc_start_main
@ 0x40680d (unknown)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
$
Thanks!
It looks like you are running the wrong caffe executable. I made a fresh install of mpi-caffe and ran the example. The first few lines of output are shown below (including the MPI init):
$ mpirun -np 3 ./build/tools/caffe train --solver=examples/cifar10-mpi/cifar10_mpi_solver.prototxt -gpu=0,1,1
I0819 13:01:28.883777 15578 caffe.cpp:435] Initialized MPI environment. Process rank 0 with 3 processes participating.
I0819 13:01:28.883803 15579 caffe.cpp:435] Initialized MPI environment. Process rank 1 with 3 processes participating.
I0819 13:01:28.883860 15580 caffe.cpp:435] Initialized MPI environment. Process rank 2 with 3 processes participating.
I0819 13:01:28.885848 15578 caffe.cpp:180] Using GPUs 0
I0819 13:01:28.885978 15580 caffe.cpp:180] Using GPUs 1
I0819 13:01:28.885964 15579 caffe.cpp:180] Using GPUs 1
Running the same command using a standard caffe install:
I0819 13:02:26.531957 15666 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.535498 15665 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.535899 15667 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.554450 15666 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.555506 15666 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.556382 15666 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.561825 15665 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.570034 15667 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.571880 15665 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.572087 15667 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.573571 15665 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.581254 15667 caffe.cpp:190] GPU 1: Tesla K40c
Can you check if this solves your issue? If not, I can look into it more.
You're right, thanks!
I am currently trying to install mpi-caffe, and although I was able to install the original version of caffe, I failed when trying to compile the mpi package. Here`s the error:
./include/caffe/layers/mpi_base_layer.hpp(24): error: class "caffe::LayerParameter" has no member "mpiparam" this->comm = (MPI_Comm)param.mpi_param().comm_id(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"
./include/caffe/layers/mpi_base_layer.hpp(25): error: class "caffe::LayerParameter" has no member "mpiparam" this->group = (MPI_Group)param.mpi_param().group_id(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"
./include/caffe/layers/mpi_base_layer.hpp(36): error: class "caffe::LayerParameter" has no member "mpi_param" int old_src = param.mpi_param().root(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"
compilation aborted for src/caffe/layers/mpi_broadcast_layer.cpp (code 2)
I have not been able to figure what the problem is, yet, and I would greatly appreciate your input on it.
I will also add that I running this in a Cray computer, using MPICH which is Cray`s standard MPI, and the standard compiler wrapper CC.
Thanks