Closed PKUFlyingPig closed 2 years ago
From the error message, it is bus error
, which might be caused by environment settings.
Can you try running https://github.com/NVIDIA/nccl-tests first?
I found that I did not install the NCCL library and after the installation and passing the nccl-tests examples, step3 finishes successfully, i,e, the triton server is launched successfully in the container. However, when I tried to run the identity_test.py
, the bus error
occurred again.
python3 ${WORKSPACE}/tools/bert/identity_test.py \
--hf_ckpt_path ./bert-base-uncased/ \
--num_runs 100 \
--inference_data_type fp16
the error log:
I0819 03:55:09.531068 1298 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0819 03:55:09.531286 1298 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0819 03:55:09.573996 1298 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
dfW0819 03:59:45.127153 1298 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0, executing 1 requests
W0819 03:59:45.127181 1298 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0 with 1 requests
W0819 03:59:45.127190 1298 libfastertransformer.cc:886] get total batch_size = 8
W0819 03:59:45.127199 1298 libfastertransformer.cc:1296] get input count = 2
W0819 03:59:45.127484 1298 libfastertransformer.cc:1368] collect name: input_hidden_state size: 393216 bytes
W0819 03:59:45.127498 1298 libfastertransformer.cc:1368] collect name: sequence_lengths size: 32 bytes
W0819 03:59:45.127503 1298 libfastertransformer.cc:1379] the data is in CPU
W0819 03:59:45.127508 1298 libfastertransformer.cc:1386] the data is in CPU
W0819 03:59:45.127531 1298 libfastertransformer.cc:1244] before ThreadForward 0
W0819 03:59:45.127593 1298 libfastertransformer.cc:1252] after ThreadForward 0
W0819 03:59:45.127600 1298 libfastertransformer.cc:1244] before ThreadForward 1
W0819 03:59:45.127639 1298 libfastertransformer.cc:1252] after ThreadForward 1
I0819 03:59:45.127637 1298 libfastertransformer.cc:1090] Start to forward
I0819 03:59:45.127680 1298 libfastertransformer.cc:1090] Start to forward
[ip-172-31-56-220:1298 :0:1588] Caught signal 7 (Bus error: nonexistent physical address)
[ip-172-31-56-220:01298] *** Process received signal ***
[ip-172-31-56-220:01298] Signal: Bus error (7)
[ip-172-31-56-220:01298] Signal code: Non-existant physical address (2)
[ip-172-31-56-220:01298] Failing at address: 0x7f07a2ff4000
[ip-172-31-56-220:01298] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f07cc9c1420]
[ip-172-31-56-220:01298] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f07cb4f7b41]
[ip-172-31-56-220:01298] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6929c)[0x7f070d55729c]
[ip-172-31-56-220:01298] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb62)[0x7f070d559b62]
[ip-172-31-56-220:01298] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x50b48)[0x7f070d53eb48]
[ip-172-31-56-220:01298] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x58c88)[0x7f070d546c88]
[ip-172-31-56-220:01298] [ 6] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f07cc9b5609]
[ip-172-31-56-220:01298] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f07cb48b133]
[ip-172-31-56-220:01298] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-56-220 exited on signal 7 (Bus error).
Since I launch the container with --net=host, I run the identity_test.py
on my host machine. I do not know if this has anything to do with the error above. I also run the nccl-tests
in the container, and it passed the example tests.
What nccl-tests do you run? Can you run the nccl-tests under multi-thread?
I run these two examples. Sorry I'm not familiar with nccl, what do you mean by "run the nccl-tests under multi-thread"? Should I set -t ?
Try this one
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 -t 2
It seems OK.
Here is my config.pbtxt
:
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "bert"
max_batch_size: 1024
input [
{
name: "input_hidden_state"
data_type: TYPE_FP16
dims: [ -1, -1 ]
},
{
name: "sequence_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
}
]
output [
{
name: "output_hidden_state"
data_type: TYPE_FP16
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "2"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
parameters {
key: "model_type"
value: {
string_value: "bert"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "./all_models/bert/fastertransformer/1/1-gpu/"
}
}
parameters {
key: "int8_mode"
value: {
string_value: "0"
}
}
parameters {
key: "is_sparse"
value: {
string_value: "0"
}
}
parameters {
key: "is_remove_padding"
value: {
string_value: "1"
}
}
In the output log, I also found a warning:
I failed to find the server.cc
so I did not know what happened exactly. Does this relate to the error? I run the `nvidia-smi topo -m" and here is the output:
@PKUFlyingPig try –shm-size=1g –ulimit memlock=-1
to increase the shared memory for containers.
Thanks, this solved my problem. One last question about the pipeline parallelism: does it automatically pipeline the client's requests? If I call the client.infer()
multiple times sequentially, does it automatically pipeline all these inference requests?
No. pipeline parallelism only works request by request. It only split one request into multiple micro batches.
In some way, we have dynamic batching to batch all of them into one, and then inside FT, we split it into micro batches to better pipeline them. It works similarly like what you desire.
I see, thanks for your patient answers ~~
Description
Reproduced Steps
Preparation: I follow the steps in the bert_guide document to setup the docker container and convert the model with
--infer_tensor_para_size=1
. Also I tested it using${WORKSPACE}/tools/bert/identity_test.py
with T=1, P=1 successfully.Step1: Set the
pipeline_para_size=2
andmodel_checkpoint_path=./all_models/bert/fastertransformer/1/1-gpu/
inconfig.pbtxt
.Step2: Launch the container
step 3: Launch the triton server in the container
Then the error occurs as follows: