[Caffe2] Unable to use MPI rendezvous in Caffe2 #10582

Open gyani91 opened 6 years ago

gyani91 commented 6 years ago

Issue description

Unable to use MPI rendezvous in Caffe2.

I understand that this information may not be sufficient for helping me out. Hence, I request you to ask to perform whatever steps that are required to get more information about the situation.

I am grateful for your help.

Code example

Details: For reproducibility, I am using a container made using the following the Dockerfile:

FROM nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
LABEL maintainer=""

# caffe2 install with gpu support

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    git \
    libgflags-dev \
    libgoogle-glog-dev \
    libgtest-dev \
    libiomp-dev \
    libleveldb-dev \
    liblmdb-dev \
    libopencv-dev \
    libprotobuf-dev \
    libsnappy-dev \
    protobuf-compiler \
    python-dev \
    python-numpy \
    python-pip \
    python-pydot \
    python-setuptools \
    python-scipy \
    wget \
    && rm -rf /var/lib/apt/lists/*

RUN wget -q \
    && tar xf mpich-3.1.4.tar.gz \
    && cd mpich-3.1.4 \
    && ./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr \
    && make -j$(nproc) \
    && make install \
    && ldconfig \
    && cd .. \
    && rm -rf mpich-3.1.4 \
    && rm mpich-3.1.4.tar.gz

RUN pip install --no-cache-dir --upgrade pip==9.0.3 setuptools wheel
RUN pip install --no-cache-dir \
    flask \
    future \
    graphviz \
    hypothesis \
    jupyter \
    matplotlib \
    numpy \
    protobuf \
    pydot \
    python-nvd3 \
    pyyaml \
    requests \
    scikit-image \
    scipy \
    setuptools \
    six \

########## INSTALLATION STEPS ###################
RUN git clone --branch master --recursive
RUN cd pytorch && mkdir build && cd build \
    && cmake .. \
    -DCUDA_ARCH_NAME=Manual \
    -DCUDA_ARCH_BIN="35 52 60 61" \
    -DCUDA_ARCH_PTX="61" \
    && make -j"$(nproc)" install \
    && ldconfig \
    && make clean \
    && cd .. \
    && rm -rf build


The command:

srun -N 4 -n 4 -C gpu \
shifter run --mpi load/library/caffe2_container_diff \
python \
--train_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_train \
--test_data=$SCRATCH/caffe2_notebooks/tutorial_data/resnet_trainer/imagenet_cars_boats_val \
--db_type=lmdb \
--num_shards=4 \
--num_gpu=1 \
--num_labels=2 \
--batch_size=2 \
--epoch_size=150 \
--num_epochs=2 \
--distributed_transport ibverbs \
--distributed_interface mlx5_0

The output/error:

srun: job 9059937 queued and waiting for resources
srun: job 9059937 has been allocated resources
E0816 14:14:20.081552  7042] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081637  7042] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.081642  7042] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083420  6442] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083504  6442] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.083509  6442] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
E0816 14:14:20.087043  5987] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087126  5987] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.087131  5987] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
E0816 14:14:20.102372 11086] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102452 11086] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0816 14:14:20.102457 11086] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 144
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Add initial parameter sync
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
INFO:data_parallel_model:Creating barrier net
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
*** Aborted at 1534428860 (unix time) try "date -d @1534428860" if you are using GNU date ***
srun: error: nid06499: task 2: Segmentation fault
srun: Terminating job step 9059937.0
srun: error: nid06497: task 0: Segmentation fault
srun: error: nid06498: task 1: Segmentation fault
srun: error: nid06500: task 3: Segmentation fault

System Info

gyani91 commented 6 years ago

@pietern @apaszke I would really appreciate your help. Thanks a lot guys.

teng-li commented 6 years ago

We are writing the new distributed backend for Caffe2 and pytorch. We can make this one of the init_method.

gyani91 commented 6 years ago

Another two points that I think are worth mentioning are that I have followed all the changes mentioned in a diff by @pietern, even the changes for the Gloo header files. But it doesn't help with the situation.

And I have changed the code for from:

num_shards = args.num_shards
shard_id = args.shard_id

interfaces = args.distributed_interfaces.split(",")

# Rendezvous using MPI when run with mpirun
if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
    num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
    shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))


num_shards = args.num_shards
shard_id = args.shard_id

interfaces = args.distributed_interfaces.split(",")

# Rendezvous using MPI when run with mpirun
#if os.getenv("OMPI_COMM_WORLD_SIZE") is not None:
if True:
    #num_shards = int(os.getenv("OMPI_COMM_WORLD_SIZE", 1))
    #shard_id = int(os.getenv("OMPI_COMM_WORLD_RANK", 0))
    shard_id = int(os.getenv("SLURM_PROCID", 0))

before the above change the error was:

