steflee / mpi-caffe

mpi-caffe
http://homes.soic.indiana.edu/steflee/mpi-caffe.html
Other
49 stars 21 forks source link

compiler error: class has no member mpi_param #1

Closed michelwandermaas closed 7 years ago

michelwandermaas commented 8 years ago

I am currently trying to install mpi-caffe, and although I was able to install the original version of caffe, I failed when trying to compile the mpi package. Here`s the error:

./include/caffe/layers/mpi_base_layer.hpp(24): error: class "caffe::LayerParameter" has no member "mpiparam" this->comm = (MPI_Comm)param.mpi_param().comm_id(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"

./include/caffe/layers/mpi_base_layer.hpp(25): error: class "caffe::LayerParameter" has no member "mpiparam" this->group = (MPI_Group)param.mpi_param().group_id(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"

./include/caffe/layers/mpi_base_layer.hpp(36): error: class "caffe::LayerParameter" has no member "mpi_param" int old_src = param.mpi_param().root(); ^ detected during instantiation of "caffe::MPIBroadcastLayer::MPIBroadcastLayer(const caffe::LayerParameter &) [with Dtype=float]" at line 121 of "src/caffe/layers/mpi_broadcast_layer.cpp"

compilation aborted for src/caffe/layers/mpi_broadcast_layer.cpp (code 2)

I have not been able to figure what the problem is, yet, and I would greatly appreciate your input on it.

I will also add that I running this in a Cray computer, using MPICH which is Cray`s standard MPI, and the standard compiler wrapper CC.

Thanks

tschaffter commented 7 years ago

I have a similar issue when trying to run mpi-caffe CIFAR10 example:

$ cd /opt/mpi-caffe $ ./data/cifar10/get_cifar10.sh $ ./examples/cifar10/create_cifar10.sh

[OK] Running Caffe example: $ ./examples/cifar10/train_quick.sh

Running mpi-caffe example:

$ mpirun -np 3 caffe train --solver=examples/cifar10-mpi/cifar10_mpi_solver.prototxt --gpu=0,1,2
I0819 11:11:32.198472 48673 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:32.198756 48674 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:32.198726 48675 caffe.cpp:217] Using GPUs 0, 1, 2
I0819 11:11:38.527840 48674 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.529520 48675 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.529727 48674 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.529996 48673 caffe.cpp:222] GPU 0: Tesla K80
I0819 11:11:38.531278 48675 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.531546 48674 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:38.531757 48673 caffe.cpp:222] GPU 1: Tesla K80
I0819 11:11:38.532991 48675 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:38.533526 48673 caffe.cpp:222] GPU 2: Tesla K80
I0819 11:11:39.190536 48675 solver.cpp:48] Initializing solver from parameters: 
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
  level: 0
  stage: ""
}
snapshot_format: HDF5
I0819 11:11:39.191018 48675 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
F0819 11:11:39.191181 48675 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
*** Check failure stack trace: ***
I0819 11:11:39.191946 48674 solver.cpp:48] Initializing solver from parameters: 
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
  level: 0
  stage: ""
}
snapshot_format: HDF5
    @     0x7fd571472e6d  (unknown)
I0819 11:11:39.192312 48674 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
    @     0x7fd571474ced  (unknown)
F0819 11:11:39.192720 48674 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
    @     0x7fd571472a5c  (unknown)
*** Check failure stack trace: ***
    @     0x7fd57147563e  (unknown)
    @     0x7f545b044e6d  (unknown)
I0819 11:11:39.193928 48673 solver.cpp:48] Initializing solver from parameters: 
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10-mpi/cifar10_mpi"
solver_mode: GPU
device_id: 0
net: "examples/cifar10-mpi/cifar10_mpi_train_test.prototxt"
train_state {
  level: 0
  stage: ""
}
snapshot_format: HDF5
    @     0x7f545b046ced  (unknown)
I0819 11:11:39.194243 48673 solver.cpp:91] Creating training net from net file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
    @     0x7fd576d4ce0e  caffe::ReadNetParamsFromTextFileOrDie()
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 9:13: Message type "caffe.NetStateRule" has no field named "mpi_rank".
    @     0x7f545b044a5c  (unknown)
F0819 11:11:39.194380 48673 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
*** Check failure stack trace: ***
    @     0x7fd576d080ab  caffe::Solver<>::InitTrainNet()
    @     0x7f545b04763e  (unknown)
    @     0x7f37fa5fce6d  (unknown)
    @     0x7fd576d0917c  caffe::Solver<>::Init()
    @     0x7f546091ee0e  caffe::ReadNetParamsFromTextFileOrDie()
    @     0x7f37fa5feced  (unknown)
    @     0x7fd576d094aa  caffe::Solver<>::Solver()
    @     0x7f37fa5fca5c  (unknown)
    @     0x7f54608da0ab  caffe::Solver<>::InitTrainNet()
    @     0x7f37fa5ff63e  (unknown)
    @     0x7fd576d21113  caffe::Creator_SGDSolver<>()
    @           0x40f4ae  caffe::SolverRegistry<>::CreateSolver()
    @           0x408512  train()
    @           0x405f9c  main
    @     0x7f54608db17c  caffe::Solver<>::Init()
    @     0x7fd56803fb15  __libc_start_main
    @     0x7f37ffed6e0e  caffe::ReadNetParamsFromTextFileOrDie()
    @           0x40680d  (unknown)
    @     0x7f54608db4aa  caffe::Solver<>::Solver()
    @     0x7f37ffe920ab  caffe::Solver<>::InitTrainNet()
    @     0x7f54608f3113  caffe::Creator_SGDSolver<>()
    @           0x40f4ae  caffe::SolverRegistry<>::CreateSolver()
    @           0x408512  train()
    @           0x405f9c  main
    @     0x7f37ffe9317c  caffe::Solver<>::Init()
    @     0x7f5451c11b15  __libc_start_main
    @           0x40680d  (unknown)
    @     0x7f37ffe934aa  caffe::Solver<>::Solver()
    @     0x7f37ffeab113  caffe::Creator_SGDSolver<>()
    @           0x40f4ae  caffe::SolverRegistry<>::CreateSolver()
    @           0x408512  train()
    @           0x405f9c  main
    @     0x7f37f11c9b15  __libc_start_main
    @           0x40680d  (unknown)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
$

Thanks!

steflee commented 7 years ago

It looks like you are running the wrong caffe executable. I made a fresh install of mpi-caffe and ran the example. The first few lines of output are shown below (including the MPI init):

$ mpirun -np 3 ./build/tools/caffe train --solver=examples/cifar10-mpi/cifar10_mpi_solver.prototxt -gpu=0,1,1
I0819 13:01:28.883777 15578 caffe.cpp:435] Initialized MPI environment. Process rank 0 with 3 processes participating.
I0819 13:01:28.883803 15579 caffe.cpp:435] Initialized MPI environment. Process rank 1 with 3 processes participating.
I0819 13:01:28.883860 15580 caffe.cpp:435] Initialized MPI environment. Process rank 2 with 3 processes participating.
I0819 13:01:28.885848 15578 caffe.cpp:180] Using GPUs 0
I0819 13:01:28.885978 15580 caffe.cpp:180] Using GPUs 1
I0819 13:01:28.885964 15579 caffe.cpp:180] Using GPUs 1

Running the same command using a standard caffe install:

I0819 13:02:26.531957 15666 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.535498 15665 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.535899 15667 caffe.cpp:185] Using GPUs 0, 1, 1
I0819 13:02:26.554450 15666 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.555506 15666 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.556382 15666 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.561825 15665 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.570034 15667 caffe.cpp:190] GPU 0: Tesla K40c
I0819 13:02:26.571880 15665 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.572087 15667 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.573571 15665 caffe.cpp:190] GPU 1: Tesla K40c
I0819 13:02:26.581254 15667 caffe.cpp:190] GPU 1: Tesla K40c

Can you check if this solves your issue? If not, I can look into it more.

tschaffter commented 7 years ago

You're right, thanks!