[Caffe2] Unable to use MPI rendezvous in Caffe2

Issue description

Unable to use MPI rendezvous in Caffe2.

I understand that this information may not be sufficient for helping me out. Hence, I request you to ask to perform whatever steps that are required to get more information about the situation.

I am grateful for your help.

Code example

Details: For reproducibility, I am using a container made using the following the Dockerfile:

FROM nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
LABEL maintainer="aaronmarkham@fb.com"

# caffe2 install with gpu support

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    git \
    libgflags-dev \
    libgoogle-glog-dev \
    libgtest-dev \
    libiomp-dev \
    libleveldb-dev \
    liblmdb-dev \
    libopencv-dev \
    libprotobuf-dev \
    libsnappy-dev \
    protobuf-compiler \
    python-dev \
    python-numpy \
    python-pip \
    python-pydot \
    python-setuptools \
    python-scipy \
    wget \
    && rm -rf /var/lib/apt/lists/*

RUN wget -q http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz \
    && tar xf mpich-3.1.4.tar.gz \
    && cd mpich-3.1.4 \
    && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \
    && make -j$(nproc) \
    && make install \
    && ldconfig \
    && cd .. \
    && rm -rf mpich-3.1.4 \
    && rm mpich-3.1.4.tar.gz

RUN pip install --no-cache-dir --upgrade pip==9.0.3 setuptools wheel
RUN pip install --no-cache-dir \
    flask \
    future \
    graphviz \
    hypothesis \
    jupyter \
    matplotlib \
    numpy \
    protobuf \
    pydot \
    python-nvd3 \
    pyyaml \
    requests \
    scikit-image \
    scipy \
    setuptools \
    six \
    tornado

########## INSTALLATION STEPS ###################
RUN git clone --branch master --recursive https://github.com/pytorch/pytorch.git
RUN cd pytorch && mkdir build && cd build \
    && cmake .. \
    -DCUDA_ARCH_NAME=Manual \
    -DCUDA_ARCH_BIN="35 52 60 61" \
    -DCUDA_ARCH_PTX="61" \
    -DUSE_NNPACK=OFF \
    -DUSE_ROCKSDB=OFF \
    && make -j"$(nproc)" install \
    && ldconfig \
    && make clean \
    && cd .. \
    && rm -rf build

ENV PYTHONPATH /usr/local

The command:

srun -N 4 -n 4 -C gpu \
shifter run --mpi load/library/caffe2_container_diff \
python resnet50_trainer.py \
--train_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_train \
--test_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_val \
--db_type=lmdb \
--num_shards=4 \
--num_gpu=1 \
--num_labels=2 \
--batch_size=2 \
--epoch_size=150 \
--num_epochs=2 \
--distributed_transport ibverbs \
--distributed_interface mlx5_0

The output/error:

srun: job 9059937 queued and waiting for resources
srun: job 9059937 has been allocated resources
E0816 14:14:20.081552  7042 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081637  7042 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081642  7042 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083420  6442 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083504  6442 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083509  6442 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
E0816 14:14:20.087043  5987 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087126  5987 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087131  5987 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
E0816 14:14:20.102372 11086 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102452 11086 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102457 11086 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** SIGSEGV (@0x8) received by PID 5987 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** SIGSEGV (@0x8) received by PID 7042 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
*** SIGSEGV (@0x8) received by PID 6442 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaaaace4390 (unknown)
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
PC: @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
*** SIGSEGV (@0x8) received by PID 11086 (TID 0x2aaaaaae5480) from PID 8; stack trace: ***
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @                0x0 (unknown)
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaaaace4390 (unknown)
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @     0x2aaab0afb108 caffe2::ConvPoolOpBase<>::TensorInferenceForConv()
    @                0x0 (unknown)
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @                0x0 (unknown)
    @     0x2aaab0af78d3 std::_Function_handler<>::_M_invoke()
    @     0x2aaab09e8094 caffe2::InferBlobShapesAndTypes()
    @     0x2aaab09e9659 caffe2::InferBlobShapesAndTypesFromMap()
    @     0x2aaab032588e _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKSt6vectorINS_5bytesESaIS7_EESt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IlSaIlEESt4lessISI_ESaISt4pairIKSI_SK_EEEE36_S7_JSB_SR_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES19_
    @     0x2aaab035273e pybind11::cpp_function::dispatcher()
    @           0x4bc3fa PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4eb30f (unknown)
    @           0x4e5422 PyRun_FileExFlags
    @           0x4e3cd6 PyRun_SimpleFileExFlags
    @           0x493ae2 Py_Main
    @     0x2aaaaaf10830 __libc_start_main
    @           0x4933e9 _start
    @                0x0 (unknown)
srun: error: nid06499: task 2: Segmentation fault
srun: Terminating job step 9059937.0
srun: error: nid06497: task 0: Segmentation fault
srun: error: nid06498: task 1: Segmentation fault
srun: error: nid06500: task 3: Segmentation fault

System Info

Caffe2:
How you installed Caffe2 (conda, pip, source): Modified Dockerfile mentioned above
CUDA/cuDNN version: 8.0/7.0
GPU models and configuration: Cray XC40/XC50 supercomputer, uses SLURM!

Another two points that I think are worth mentioning are that I have followed all the changes mentioned in a diff by @pietern, even the changes for the Gloo header files. But it doesn't help with the situation.

And I have changed the code for resnet_trainer.py from:

num_shards = args.num_shards
shard_id = args.shard_id

interfaces = args.distributed_interfaces.split(",")

# Rendezvous using MPI when run with mpirun
if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
    num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
    shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))

to:

num_shards = args.num_shards
shard_id = args.shard_id

interfaces = args.distributed_interfaces.split(",")

# Rendezvous using MPI when run with mpirun
#if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
if True:
    #num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
    #shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))
    shard_id = int(os.getenv("SLURM_PROCID", 0))

before the above change the error was:

E0817 11:20:18.277760 29415 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.277838 29415 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.277843 29415 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
E0817 11:20:18.278333 15794 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.278412 15794 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.278416 15794 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
INFO:resnet50_trainer:Using epoch size: 1000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
E0817 11:20:18.365449 16188 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.365527 16188 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.365532 16188 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
E0817 11:20:18.377269 14509 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.377346 14509 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0817 11:20:18.377351 14509 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 1000
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
E0817 11:20:18.563855 15794 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.564237 15794 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.564251 15794 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
E0817 11:20:18.567664 29415 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.568056 29415 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.568071 29415 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0135490894318 secs
INFO:memonger:Memonger memory optimization took 0.0135049819946 secs
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
E0817 11:20:18.654834 16188 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.655194 16188 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.655208 16188 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
E0817 11:20:18.663950 14509 operator.cc:496] Shape inference error: [enforce fail at conv_pool_op_base.h:626] in_size + *pad_head + *pad_tail >= dkernel. 2 vs 3
E0817 11:20:18.664294 14509 operator.cc:497] Operator: input: "gpu_0/conv1_spatbn_relu" output: "gpu_0/pool1" name: "" type: "MaxPool" arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 3 } arg { name: "stride" i: 2 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "cudnn_exhaustive_search" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
E0817 11:20:18.664320 14509 operator.cc:498] Returning empty results.
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0132520198822 secs
INFO:memonger:Memonger memory optimization took 0.0132851600647 secs
E0817 11:20:51.136189 15794 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
  File "resnet50_trainer.py", line 608, in <module>
    main()
  File "resnet50_trainer.py", line 604, in main
    Train(args)
  File "resnet50_trainer.py", line 444, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
    StringifyProto(net),
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
E0817 11:20:51.183814 29415 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
  File "resnet50_trainer.py", line 608, in <module>
    main()
  File "resnet50_trainer.py", line 604, in main
    Train(args)
  File "resnet50_trainer.py", line 444, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
    StringifyProto(net),
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
E0817 11:20:51.454591 14509 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
  File "resnet50_trainer.py", line 608, in <module>
    main()
  File "resnet50_trainer.py", line 604, in main
    Train(args)
  File "resnet50_trainer.py", line 444, in Train
E0817 11:20:51.455612 16188 common_world_ops.h:110] Caught store handler timeout exception: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
    workspace.RunNetOnce(train_model.param_init_net)
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
    StringifyProto(net),
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
WARNING:caffe2.python.workspace:Original python traceback for operator `268` in network `resnet50_init` in exception above (most recent call last):
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 608, in <module>
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 604, in main
WARNING:caffe2.python.workspace:  File "resnet50_trainer.py", line 439, in Train
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 296, in Parallelize
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1215, in _AllReduceBlobs
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1362, in _AllReduceBlobsDistributed
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1346, in allreduce
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1296, in get_control_and_context
WARNING:caffe2.python.workspace:  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/data_parallel_model.py", line 1814, in _CreateOrCloneCommonWorld
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
Traceback (most recent call last):
  File "resnet50_trainer.py", line 608, in <module>
    main()
  File "resnet50_trainer.py", line 604, in main
    Train(args)
  File "resnet50_trainer.py", line 444, in Train
    workspace.RunNetOnce(train_model.param_init_net)
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 201, in RunNetOnce
    StringifyProto(net),
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 180, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [/pytorch/caffe2/distributed/file_store_handler.cc:154] Wait timeout for name(s): allreduce_0_cw_op/1
srun: error: nid02966: task 3: Exited with exit code 1
srun: Terminating job step 9072007.0
srun: error: nid02963: task 0: Exited with exit code 1
srun: error: nid02965: task 2: Exited with exit code 1
srun: error: nid02964: task 1: Exited with exit code 1

pytorch / pytorch