Pipeline parallelism does not work for FasterTransformer BERT Triton Backend.

PKUFlyingPig commented 2 years ago

Description

main branch, Docker version 20.10.17, Tesla T4 GPU

Reproduced Steps

Preparation: I follow the steps in the bert_guide document to setup the docker container and convert the model with --infer_tensor_para_size=1. Also I tested it using ${WORKSPACE}/tools/bert/identity_test.py with T=1, P=1 successfully.

Step1: Set the pipeline_para_size=2 and model_checkpoint_path=./all_models/bert/fastertransformer/1/1-gpu/ in config.pbtxt.

Step2: Launch the container

export WORKSPACE="/home/ubuntu/efs/fastertransformer_backend/"
export TRITON_DOCKER_IMAGE="triton_with_ft:22.07"
export NAME="triton_with_ft"

docker run -it --rm \
           --net=host \
           --gpus=all \
           -v ${WORKSPACE}:${WORKSPACE} \
           -w ${WORKSPACE} \
           -e WORKSPACE=${WORKSPACE} \
           --name ${NAME} \
           ${TRITON_DOCKER_IMAGE} bash

step 3: Launch the triton server in the container

CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/

Then the error occurs as follows:

I0819 02:57:50.658966 5836 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I0819 02:57:50.658975 5836 libfastertransformer.cc:248] Sequence Batching: disabled
I0819 02:57:50.660078 5836 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 14.37 GB, total: 14.56 GB, used:  0.19 GBI0819 02:57:51.427084 5836 libfastertransformer.cc:430] After Loading Weights:
after allocation    : free: 14.13 GB, total: 14.56 GB, used:  0.43 GBW0819 02:57:51.427218 5836 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backendW0819 02:57:51.431853 5836 libfastertransformer.cc:651] Model name bert
W0819 02:57:51.431868 5836 libfastertransformer.cc:661] Use COUPLED (classic) API.W0819 02:57:51.431876 5836 libfastertransformer.cc:756] Get input name: input_hidden_state, type: TYPE_FP16, shape: [-1, -1]
W0819 02:57:51.431881 5836 libfastertransformer.cc:756] Get input name: sequence_lengths, type: TYPE_INT32, shape: [1]W0819 02:57:51.431903 5836 libfastertransformer.cc:798] Get output name: output_hidden_state, type: TYPE_FP16, shape: [-1, -1][ip-172-31-56-220:05836] *** Process received signal ***
[ip-172-31-56-220:5836 :0:5944] Caught signal 7 (Bus error: nonexistent physical address)[ip-172-31-56-220:05836] Signal: Bus error (7)
[ip-172-31-56-220:05836] Signal code: Non-existant physical address (2)[ip-172-31-56-220:05836] Failing at address: 0x7f843b6cf000
[ip-172-31-56-220:05836] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f848694d420][ip-172-31-56-220:05836] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f8485483b41]
[ip-172-31-56-220:05836] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6929c)[0x7f83c555729c]
[ip-172-31-56-220:05836] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b9ae)[0x7f83c55599ae]
[ip-172-31-56-220:05836] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x50853)[0x7f83c553e853]
[ip-172-31-56-220:05836] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x417b4)[0x7f83c552f7b4]
[ip-172-31-56-220:05836] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x42c4d)[0x7f83c5530c4d]
[ip-172-31-56-220:05836] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x58b37)[0x7f83c5546b37]
[ip-172-31-56-220:05836] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f8486941609]
[ip-172-31-56-220:05836] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f8485417133]
[ip-172-31-56-220:05836] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-56-220 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

byshiue commented 2 years ago

From the error message, it is bus error, which might be caused by environment settings. Can you try running https://github.com/NVIDIA/nccl-tests first?

PKUFlyingPig commented 2 years ago

I found that I did not install the NCCL library and after the installation and passing the nccl-tests examples, step3 finishes successfully, i,e, the triton server is launched successfully in the container. However, when I tried to run the identity_test.py, the bus error occurred again.

python3 ${WORKSPACE}/tools/bert/identity_test.py \
        --hf_ckpt_path ./bert-base-uncased/ \
        --num_runs 100 \
        --inference_data_type fp16

the error log:

I0819 03:55:09.531068 1298 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0819 03:55:09.531286 1298 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0819 03:55:09.573996 1298 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
dfW0819 03:59:45.127153 1298 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0, executing 1 requests
W0819 03:59:45.127181 1298 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0 with 1 requests
W0819 03:59:45.127190 1298 libfastertransformer.cc:886] get total batch_size = 8
W0819 03:59:45.127199 1298 libfastertransformer.cc:1296] get input count = 2
W0819 03:59:45.127484 1298 libfastertransformer.cc:1368] collect name: input_hidden_state size: 393216 bytes
W0819 03:59:45.127498 1298 libfastertransformer.cc:1368] collect name: sequence_lengths size: 32 bytes
W0819 03:59:45.127503 1298 libfastertransformer.cc:1379] the data is in CPU
W0819 03:59:45.127508 1298 libfastertransformer.cc:1386] the data is in CPU
W0819 03:59:45.127531 1298 libfastertransformer.cc:1244] before ThreadForward 0
W0819 03:59:45.127593 1298 libfastertransformer.cc:1252] after ThreadForward 0
W0819 03:59:45.127600 1298 libfastertransformer.cc:1244] before ThreadForward 1
W0819 03:59:45.127639 1298 libfastertransformer.cc:1252] after ThreadForward 1
I0819 03:59:45.127637 1298 libfastertransformer.cc:1090] Start to forward
I0819 03:59:45.127680 1298 libfastertransformer.cc:1090] Start to forward
[ip-172-31-56-220:1298 :0:1588] Caught signal 7 (Bus error: nonexistent physical address)
[ip-172-31-56-220:01298] *** Process received signal ***
[ip-172-31-56-220:01298] Signal: Bus error (7)
[ip-172-31-56-220:01298] Signal code: Non-existant physical address (2)
[ip-172-31-56-220:01298] Failing at address: 0x7f07a2ff4000
[ip-172-31-56-220:01298] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f07cc9c1420]
[ip-172-31-56-220:01298] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f07cb4f7b41]
[ip-172-31-56-220:01298] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6929c)[0x7f070d55729c]
[ip-172-31-56-220:01298] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb62)[0x7f070d559b62]
[ip-172-31-56-220:01298] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x50b48)[0x7f070d53eb48]
[ip-172-31-56-220:01298] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x58c88)[0x7f070d546c88]
[ip-172-31-56-220:01298] [ 6] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f07cc9b5609]
[ip-172-31-56-220:01298] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f07cb48b133]
[ip-172-31-56-220:01298] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-56-220 exited on signal 7 (Bus error).

Since I launch the container with --net=host, I run the identity_test.py on my host machine. I do not know if this has anything to do with the error above. I also run the nccl-tests in the container, and it passed the example tests.

byshiue commented 2 years ago

What nccl-tests do you run? Can you run the nccl-tests under multi-thread?

PKUFlyingPig commented 2 years ago

I run these two examples. Sorry I'm not familiar with nccl, what do you mean by "run the nccl-tests under multi-thread"? Should I set -t ?

byshiue commented 2 years ago

Try this one

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 -t 2

PKUFlyingPig commented 2 years ago

It seems OK.

PKUFlyingPig commented 2 years ago

Here is my config.pbtxt:


# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "bert"
max_batch_size: 1024
input [
  {
    name: "input_hidden_state"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
]
output [
  {
    name: "output_hidden_state"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "2"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "bert"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "./all_models/bert/fastertransformer/1/1-gpu/"
  }
}
parameters {
  key: "int8_mode"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_sparse"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_remove_padding"
  value: {
    string_value: "1"
  }
}

PKUFlyingPig commented 2 years ago

In the output log, I also found a warning: I failed to find the server.cc so I did not know what happened exactly. Does this relate to the error? I run the `nvidia-smi topo -m" and here is the output:

PerkzZheng commented 2 years ago

@PKUFlyingPig try –shm-size=1g –ulimit memlock=-1 to increase the shared memory for containers.

PKUFlyingPig commented 2 years ago

Thanks, this solved my problem. One last question about the pipeline parallelism: does it automatically pipeline the client's requests? If I call the client.infer() multiple times sequentially, does it automatically pipeline all these inference requests?

byshiue commented 2 years ago

No. pipeline parallelism only works request by request. It only split one request into multiple micro batches.

PerkzZheng commented 2 years ago

In some way, we have dynamic batching to batch all of them into one, and then inside FT, we split it into micro batches to better pipeline them. It works similarly like what you desire.

PKUFlyingPig commented 2 years ago

I see, thanks for your patient answers ~~

triton-inference-server / fastertransformer_backend

Pipeline parallelism does not work for FasterTransformer BERT Triton Backend. #33

Description

Reproduced Steps