server crashs when traffic is a little bit high

rahuan commented 1 year ago

Description

main branch, V100

Deployed docker pods crashs and restarts every few minutes. It seems stable when qps is low.
Below is error log before pods crashs which I get using command: kubectl logs <pod-name> --all-containers -p
 0# 0x000055F47681BC19 in tritonserver
 1# 0x00007F17168FE090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F1716CB7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F1716CC338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F1716CC33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F1716CC36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 8# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# 0x00007F170C34EE0A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
10# 0x00007F1716CEFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# 0x00007F1717F04609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
12# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also I observed below error response from triton client sometimes:
[StatusCode.INTERNAL] pinned buffer: failed to perform CUDA copy: invalid argument

Maybe the client error is related to server pod crash.
Thanks so much if it could get fixed.

Reproduced Steps

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)exportCONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}\
    -t ${TRITON_DOCKER_IMAGE}\
    -f docker/Dockerfile \
docker tag ${TRITON_DOCKER_IMAGE}<github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
docker push <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}

Upload ft model folders and deploy triton services to docker, then test using triton client API

rahuan commented 1 year ago

BTW, my model is BERT, any hints?

PerkzZheng commented 1 year ago

@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by --log-verbose 1?

Also, try without k8s first to see if it is reproducible.

rahuan commented 1 year ago

@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by --log-verbose 1?

Also, try without k8s first to see if it is reproducible.

I built the image using the latest code of fastertransformer_backend at Nov. 24th. Also I used --log-verbose=2. Let's try to sync latest code now, and rebuild.

rahuan commented 1 year ago

I just synced latest code of fastertransformer_backend now, it fails even faster at a very low qps. below are errors: I1212 06:38:03.990948 1 libfastertransformer.cc:1022] get total batch_size = 1 I1212 06:38:03.990967 1 libfastertransformer.cc:1433] get input count = 2 I1212 06:38:03.991087 1 libfastertransformer.cc:1672] collect name: input_hidden_state size: 353280 bytes I1212 06:38:03.991104 1 libfastertransformer.cc:1672] collect name: sequence_lengths size: 40 bytes I1212 06:38:03.991113 1 libfastertransformer.cc:1683] the data is in CPU I1212 06:38:03.991121 1 libfastertransformer.cc:1690] the data is in CPU I1212 06:38:03.991145 1 libfastertransformer.cc:1380] before ThreadForward 0 I1212 06:38:03.991202 1 libfastertransformer.cc:1388] after ThreadForward 0 I1212 06:38:03.991222 1 libfastertransformer.cc:1226] Start to forward terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/attention_layers/FusedAttentionLayer.cu:178

Signal (6) received. 0# 0x000055C30FAAFC19 in tritonserver 1# 0x00007FB022A0E090 in /usr/lib/x86_64-linux-gnu/libc.so.6 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6 4# 0x00007FB022DC7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 5# 0x00007FB022DD338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 6# 0x00007FB022DD33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 7# 0x00007FB022DD36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 8# fastertransformer::myAssert(bool, char const, int, std::cxx11::basic_string<char, std::char_traits, std::allocator >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 9# fastertransformer::FusedAttentionLayer<half>::forward(fastertransformer::TensorMap, fastertransformer::TensorMap, fastertransformer::AttentionWeight<__half> const) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 10# fastertransformer::Bert<__half>::forward(fastertransformer::TensorMap, fastertransformer::TensorMap, fastertransformer::BertWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 11# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, triton::Tensor, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 12# 0x00007FB0184521F7 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so 13# 0x00007FB022DFFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 14# 0x00007FB024014609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0 15# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

PerkzZheng commented 1 year ago

what seq length you are using ?

rahuan commented 1 year ago

what seq length you are using ?

Batch_size is 10 or 20, seq length is different for each sentence in a batch, average is about 50~60, but the max seq length is limited to 128

PerkzZheng commented 1 year ago

thanks. Can you also share the head-size you are using? that will be helpful for us to reproduce.

rahuan commented 1 year ago

The model settings are the same as bert-base chinese, layer num is 12, head num is 12, hidden size is 768 = 64*12, thanks! BTW, the data_type is fp16, is_remove_padding is set to 1

rahuan commented 1 year ago

@PerkzZheng, may I ask if any findings about this issue?

PerkzZheng commented 1 year ago

@rahuan sorry for the late response. You can print out the before this line here. You might have changed how s is given, and we can not find the corresponding kernel.

triton-inference-server / fastertransformer_backend

server crashs when traffic is a little bit high #77

Description

Reproduced Steps