triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
412 stars 133 forks source link

[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len #74

Closed 520jefferson closed 1 year ago

520jefferson commented 1 year ago

Description

branch: fastertransformer_backend-release-v1.2.1_tag/
triton with ft container verion: 22.07
gpu:v100 
model:huggingface t5-base

Reproduced Steps

After build the docker , then I use the huggingface origin t5-base , then i convert the model:
1,Convert model: 
python  ./build/fastertransformer_backend/build/_deps/repo-ft-src/examples/pytorch/t5/utils/t5_ckpt_convert.py   -o   /workspace/build/fastertransformer_backend/all_models/t5/fastertransformer/1 -i /FT5/t5-base/ -infer_gpu_num 1

2,start model 
export CUDA_VISIBLE_DEVICES=6
/workspace/build/fastertransformer_backend/all_models/t5/fastertransformer# mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5

the i met this error:
I1115 06:40:44.582619 7292 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbbd4000000' with size 268435456
I1115 06:40:44.583982 7292 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1115 06:40:44.595437 7292 model_repository_manager.cc:1206] loading: fastertransformer:1
I1115 06:40:44.685154 7292 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1115 06:40:44.685175 7292 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1115 06:40:44.685180 7292 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1115 06:40:44.685213 7292 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1115 06:40:44.686569 7292 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1115 06:40:44.686588 7292 libfastertransformer.cc:248] Sequence Batching: disabled
[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len. 
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35452,1],0]
  Exit code:    255
--------------------------------------------------------------------------

If i set relative_attention_num_buckets_or_max_pos_seq_len=32 in config.ini, then i will met this error:
[ERROR] Does not find the section encoder with name weight_data_type.
byshiue commented 1 year ago

Can you post the config.ini?

520jefferson commented 1 year ago

the config.ini is as follows, now i try to use (/workspace# python3 ./build/fastertransformer_backend/build/_deps/repo-ft-src/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py -in_file /wlj/545000 -saved_dir /workspace/build/fastertransformer_backend/all_models/t5_545000/1 -inference_tensor_para_size 1) to convert model, maybe i shouldn't use t5_ckpt_convert.py to convert model?

[encoder] vocab_size = 33795 d_model = 768 d_kv = 64 d_ff = 3072 num_layers = 12 num_decoder_layers = 12 num_heads = 12 relative_attention_num_buckets = 32 relative_attention_max_distance = 128 dropout_rate = 0.1 layer_norm_epsilon = 1e-06 initializer_factor = 1.0 feed_forward_proj = relu use_cache = False dense_act_fn = relu is_gated_act = False return_dict = True output_hidden_states = False output_attentions = False torchscript = False torch_dtype = float32 use_bfloat16 = False tf_legacy_loss = False pruned_heads = {} tie_word_embeddings = True is_encoder_decoder = False is_decoder = False cross_attention_hidden_size = None add_cross_attention = False tie_encoder_decoder = False max_length = 20 min_length = 0 do_sample = False early_stopping = False num_beams = 1 num_beam_groups = 1 diversity_penalty = 0.0 temperature = 1.0 top_k = 50 top_p = 1.0 typical_p = 1.0 repetition_penalty = 1.0 length_penalty = 1.0 no_repeat_ngram_size = 0 encoder_no_repeat_ngram_size = 0 bad_words_ids = None num_return_sequences = 1 chunk_size_feed_forward = 0 output_scores = False return_dict_in_generate = False forced_bos_token_id = None forced_eos_token_id = None remove_invalid_values = False exponential_decay_length_penalty = None suppress_tokens = None begin_suppress_tokens = None architectures = ['T5ForConditionalGeneration'] finetuning_task = None id2label = {0: 'LABEL_0', 1: 'LABEL_1'} label2id = {'LABEL_0': 0, 'LABEL_1': 1} tokenizer_class = None prefix = None bos_token_id = None pad_token_id = 3 eos_token_id = 33606 sep_token_id = None decoder_start_token_id = 33605 task_specific_params = None problem_type = None _name_or_path = /wlj/545000/ transformers_version = 4.24.0 model_type = t5 n_positions = 512 output_past = True

[decoder] vocab_size = 33795 d_model = 768 d_kv = 64 d_ff = 3072 num_layers = 12 num_decoder_layers = 12 num_heads = 12 relative_attention_num_buckets = 32 relative_attention_max_distance = 128 dropout_rate = 0.1 layer_norm_epsilon = 1e-06 initializer_factor = 1.0 feed_forward_proj = relu use_cache = True dense_act_fn = relu is_gated_act = False return_dict = True output_hidden_states = False output_attentions = False torchscript = False torch_dtype = float32 use_bfloat16 = False tf_legacy_loss = False pruned_heads = {} tie_word_embeddings = True is_encoder_decoder = False is_decoder = True cross_attention_hidden_size = None add_cross_attention = False tie_encoder_decoder = False max_length = 20 min_length = 0 do_sample = False early_stopping = False num_beams = 1 num_beam_groups = 1 diversity_penalty = 0.0 temperature = 1.0 top_k = 50 top_p = 1.0 typical_p = 1.0 repetition_penalty = 1.0 length_penalty = 1.0 no_repeat_ngram_size = 0 encoder_no_repeat_ngram_size = 0 bad_words_ids = None num_return_sequences = 1 chunk_size_feed_forward = 0 output_scores = False return_dict_in_generate = False forced_bos_token_id = None forced_eos_token_id = None remove_invalid_values = False exponential_decay_length_penalty = None suppress_tokens = None begin_suppress_tokens = None architectures = ['T5ForConditionalGeneration'] finetuning_task = None id2label = {0: 'LABEL_0', 1: 'LABEL_1'} label2id = {'LABEL_0': 0, 'LABEL_1': 1} tokenizer_class = None prefix = None bos_token_id = None pad_token_id = 3 eos_token_id = 33606 sep_token_id = None decoder_start_token_id = 33605 task_specific_params = None problem_type = None _name_or_path = /wlj/545000/ transformers_version = 4.24.0 model_type = t5 n_positions = 512 output_past = True

byshiue commented 1 year ago

If you your model is from HF, you should use huggingface_t5_ckpt_converter.py

520jefferson commented 1 year ago

The v100 cuda version is 11.1, but the docker cuda version is 11.7(22.07 container)

After convert the model with huggingface_t5_ckpt_converter.py, then i start the server,i met this, it sames the cuda11.7 is not compatible with cuda11.1, so i should rebuild the with cuda 11.1 inner the docker like https://github.com/triton-inference-server/fastertransformer_backend#rebuilding-fastertransformer-backend-optional after i modify the cuda11.7 to 11.1 in the docker? or maybe i should rebuild the ft with another container which's cuda version is not bigger than 11.1?

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000I1115 07:33:28.755888 8867 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fee7e000000' with size 268435456 I1115 07:33:28.757518 8867 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I1115 07:33:28.768517 8867 model_repository_manager.cc:1206] loading: fastertransformer:1 I1115 07:33:28.866421 8867 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer I1115 07:33:28.866448 8867 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10 I1115 07:33:28.866453 8867 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10 I1115 07:33:28.866490 8867 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1115 07:33:28.867923 8867 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1 I1115 07:33:28.867942 8867 libfastertransformer.cc:248] Sequence Batching: disabled I1115 07:33:28.868163 8867 libfastertransformer.cc:420] Before Loading Weights: after allocation : free: 11.92 GB, total: 31.75 GB, used: 19.83 GB I1115 07:33:30.377086 8867 libfastertransformer.cc:430] After Loading Weights: after allocation : free: 11.24 GB, total: 31.75 GB, used: 20.51 GB W1115 07:33:30.377193 8867 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend I1115 07:33:30.379198 8867 libfastertransformer.cc:451] Before Loading Model: after allocation : free: 11.24 GB, total: 31.75 GB, used: 20.51 GB terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] CUDA runtime error: API call is not supported in the installed CUDA driver /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:157

byshiue commented 1 year ago

I don't understand the meaning of "The v100 cuda version is 11.1".

You can re-compile FT by cuda 11.1 in the docker. But I am not sure can you use cuda 11.1 to launch the server because you don't recompile the triton server.

520jefferson commented 1 year ago

"The v100 cuda version is 11.1" means the i use the nvida-smi, the i can see the cuda version is 11.1: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+

Thanks a lot, i will rebuild all with another container version.

520jefferson commented 1 year ago

When i build with 20.12 container, i met this: fatal: unable to access 'http://github.com/triton-inference-server/core.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated. Cloning into 'repo-core-src'... fatal: unable to access 'http://github.com/triton-inference-server/core.git/': Failed to connect to github.com port 443: Connection timed out

the git access is not oaky,Is there an alternative?

byshiue commented 1 year ago

For question about triton server, please ask in https://github.com/triton-inference-server/server.

520jefferson commented 1 year ago

I rebuild the docker in v100 machine, with container version 20.12 like this, but i met a error which i never see before. --------------------------run.sh------------------------------------------- cd fastertransformer_backend-release-v1.2.1_tag export WORKSPACE=$(pwd) export CONTAINER_VERSION=20.12 export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

prepare docker image

docker build --rm \ --build-arg TRITON_VERSION=${CONTAINER_VERSION} \ -t ${TRITON_DOCKER_IMAGE} \ -f docker/Dockerfile \ .


[ 65%] Linking CXX static library ../../../../../../lib/libBertINT8.a [ 65%] Built target BertINT8 Scanning dependencies of target bert_int8_example [ 66%] Building CXX object _deps/repo-ft-build/examples/cpp/bert_int8/CMakeFiles/bert_int8_example.dir/bert_int8_example.cc.o In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/vit_int8/ViTINT8.h:26, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/vit_int8/ViTINT8.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/conv2d.h: In function 'void fastertransformer::conv2d(T, const T, const T, int, int, int, int, int, int, int, cudnnContext&)': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/conv2d.h:50:20: error: 'CUDNN_DATA_BFLOAT16' was not declared in this scope; did you mean 'CUDNN_DATA_FLOAT'? 50 | dataType = CUDNN_DATA_BFLOAT16; | ^~~~~~~ | CUDNN_DATA_FLOAT [ 66%] Linking CUDA executable ../../../../bin/test_penalty_kernels [ 66%] Built target test_penalty_kernels [ 66%] Linking CUDA executable ../../../../bin/test_gpt_kernels [ 67%] Linking CUDA executable ../../../../bin/test_logprob_kernels make[2]: [_deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/ViTINT8.cc.o] Error 1 make[1]: [CMakeFiles/Makefile2:5877: _deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/all] Error 2 make[1]: Waiting for unfinished jobs.... [ 67%] Built target test_gpt_kernels ... ... ... /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/sampling_topp_kernels.h:110:8: note: 'struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext' declared here 110 | struct TopKPerSegmentContext { | ^~~~~ [ 72%] Linking CUDA device code CMakeFiles/sampling_topp_kernels.dir/cmake_device_link.o [ 72%] Linking CUDA static library ../../../../../lib/libsampling_topp_kernels.a [ 72%] Built target sampling_topp_kernels [ 72%] Linking CUDA device code CMakeFiles/online_softmax_beamsearch_kernels.dir/cmake_device_link.o [ 73%] Linking CUDA static library ../../../../../lib/libonline_softmax_beamsearch_kernels.a [ 73%] Built target online_softmax_beamsearch_kernels make: [Makefile:149: all] Error 2 The command '/bin/sh -c mkdir build -p && cd build && cmake -D CMAKE_EXPORT_COMPILE_COMMANDS=1 -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/opt/tritonserver -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" .. && make -j"$(grep -c ^processor /proc/cpuinfo)" install' returned a non-zero code: 2

the total log in this file:https://drive.google.com/file/d/1rkUGmC9AG1_AlxFNtukF8vDD8SIkCMkI/view?usp=sharing

byshiue commented 1 year ago

The docker image is too old to have bfloat16 in CUDNN. You can find a flag to close the bfloat16 compile in FT cmake file.

520jefferson commented 1 year ago

I remove the follow config in cmakefile, and rebuild it , then i met the error below. i use fastertransformer v5.1.1,fastertransformer_backend-release-v1.2.1_tag, Maybe they don't match?


if(${CUDA_VERSION_MAJOR} VERSION_GREATER_EQUAL "11") add_definitions("-DENABLE_BF16") message("CUDA_VERSION ${CUDA_VERSION_MAJOR} is greater or equal than 11, enable -DENABLE_BF16 flag") endif()

[ 98%] Building CXX object _deps/repo-ft-build/examples/cpp/gptj/CMakeFiles/gptj_triton_example.dir/gptj_triton_example.cc.o [ 98%] Built target ParallelGptTritonBackend Scanning dependencies of target transformer-shared Scanning dependencies of target transformer-static [ 98%] Linking CUDA device code CMakeFiles/transformer-shared.dir/cmake_device_link.o [ 98%] Linking CUDA device code CMakeFiles/transformer-static.dir/cmake_device_link.o Scanning dependencies of target multi_gpu_gpt_triton_example [ 98%] Building CXX object _deps/repo-ft-build/examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_triton_example.dir/multi_gpu_gpt_triton_example.cc.o [ 99%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_interactive_example [ 99%] Built target gptj_example [ 99%] Built target gptneox_example [ 99%] Built target multi_gpu_gpt_example [ 99%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_async_example [ 99%] Linking CXX static library ../../lib/libtransformer-static.a [ 99%] Linking CXX shared library ../../lib/libtransformer-shared.so [ 99%] Built target multi_gpu_gpt_interactive_example [ 99%] Built target multi_gpu_gpt_async_example [ 99%] Built target transformer-shared Scanning dependencies of target triton-fastertransformer-backend [ 99%] Building CXX object CMakeFiles/triton-fastertransformer-backend.dir/src/libfastertransformer.cc.o [ 99%] Built target transformer-static /workspace/build/fastertransformer_backend/src/libfastertransformer.cc: In constructor 'triton::backend::fastertransformer_backend::ModelState::ModelState(TRITONBACKEND_Model)': /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:153:38: error: no matching function for call to 'triton::backend::BackendModel::BackendModel(TRITONBACKEND_Model&, bool)' 153 | : BackendModel(triton_model, true) | ^ In file included from /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:45: /workspace/build/fastertransformer_backend/build/_deps/repo-backend-src/include/triton/backend/backend_model.h:43:3: note: candidate: 'triton::backend::BackendModel::BackendModel(TRITONBACKEND_Model)' 43 | BackendModel(TRITONBACKEND_Model triton_model); | ^~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-backend-src/include/triton/backend/backend_model.h:43:3: note: candidate expects 1 argument, 2 provided [ 99%] Linking CXX executable ../../../../../bin/gptneox_triton_example [ 99%] Built target gptneox_triton_example [100%] Linking CXX executable ../../../../../bin/gptj_triton_example [100%] Built target gptj_triton_example [100%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_triton_example [100%] Built target multi_gpu_gpt_triton_example make[2]: [CMakeFiles/triton-fastertransformer-backend.dir/build.make:82: CMakeFiles/triton-fastertransformer-backend.dir/src/libfastertransformer.cc.o] Error 1 make[1]: [CMakeFiles/Makefile2:1449: CMakeFiles/triton-fastertransformer-backend.dir/all] Error 2 make[1]: Waiting for unfinished jobs.... /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h: In function 'bool test_context_sharing(const string&, const string&) [with T = float]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h:72:144: warning: 'pipeline_para.fastertransformer::NcclParam::nccluid' may be used uninitialized in this function [-Wmaybe-uninitialized] 72 | NcclParam(NcclParam const& param): | ^ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h:72:144: warning: 'tensor_para.fastertransformer::NcclParam::nccluid' may be used uninitialized in this function [-Wmaybe-uninitialized] 72 | NcclParam(NcclParam const& param): | ^ [100%] Linking CXX executable ../../../../bin/test_context_decoder_layer [100%] Built target test_context_decoder_layer [100%] Linking CXX executable ../../../../bin/test_sampling [100%] Built target test_sampling make: [Makefile:149: all] Error 2 The command '/bin/sh -c mkdir build -p && cd build && cmake -D CMAKE_EXPORT_COMPILE_COMMANDS=1 -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/opt/tritonserver -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" .. && make -j"$(grep -c ^processor /proc/cpuinfo)" install' returned a non-zero code: 2

the whole logs:https://drive.google.com/file/d/1B1_vRKtnJc_O1HTlH2BF6aA4G2ON9gu8/view?usp=sharing

byshiue commented 1 year ago

Need to change

BackendModel(triton_model, true)

of https://github.com/triton-inference-server/fastertransformer_backend/blob/main/src/libfastertransformer.cc#L153 to

BackendModel(triton_model)
520jefferson commented 1 year ago

I can rebuild successful. 1, but when i use the the config https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/t5/fastertransformer/config.pbtxt to start , i met this error below:

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000 I1116 08:58:32.709152 752 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 08:58:32.999338 752 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 08:58:32.999361 752 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 08:58:32.999366 752 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 08:58:33.213690: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 08:58:33.256948 752 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 08:58:33.256971 752 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 08:58:33.256976 752 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 08:58:33.256980 752 tensorflow.cc:1917] backend configuration: {} I1116 08:58:33.258658 752 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 08:58:33.258677 752 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 08:58:33.258682 752 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 08:58:33.530428 752 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7fd95c000000' with size 268435456 I1116 08:58:33.531686 752 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:317] Error parsing text-format inference.ModelConfig: 49:13: Message type "inference.ModelInput" has no field named "optional". E1116 08:58:33.535182 752 model_repository_manager.cc:1682] failed to read text proto from /workspace/build/fastertransformer_backend/all_models/t5_545000/fastertransformer/config.pbtxt I1116 08:58:33.535260 752 server.cc:490]


2,when i use a old config https://drive.google.com/file/d/1gFfC2MDKdyXLflufQQI0BgT51RgO08ag/view?usp=sharing i met another error below

I1116 09:06:16.007486 786 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 09:06:16.299886 786 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 09:06:16.299913 786 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.299918 786 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 09:06:16.518536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 09:06:16.563957 786 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 09:06:16.563982 786 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.563987 786 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 09:06:16.563991 786 tensorflow.cc:1917] backend configuration: {} I1116 09:06:16.565688 786 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 09:06:16.565710 786 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.565715 786 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 09:06:16.838932 786 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7fbb24000000' with size 268435456 I1116 09:06:16.840197 786 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 I1116 09:06:16.845078 786 model_repository_manager.cc:787] loading: fastertransformer:1 I1116 09:06:16.988110 786 libfastertransformer.cc:1479] TRITONBACKEND_Initialize: fastertransformer I1116 09:06:16.988127 786 libfastertransformer.cc:1489] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.988131 786 libfastertransformer.cc:1495] 'fastertransformer' TRITONBACKEND API version: 1.0 I1116 09:06:16.988839 786 libfastertransformer.cc:1527] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1116 09:06:16.989918 786 libfastertransformer.cc:219] Instance group type: KIND_CPU count: 1 I1116 09:06:16.991597 786 libfastertransformer.cc:249] Sequence Batching: disabled E1116 09:06:16.991613 786 libfastertransformer.cc:374] Invalid configuration argument 'data_type': I1116 09:06:16.991617 786 libfastertransformer.cc:421] Before Loading Weights: after allocation : free: 30.43 GB, total: 31.75 GB, used: 1.32 GB

byshiue commented 1 year ago

old triton does not support optional input. You need to remove them from config file.

520jefferson commented 1 year ago

I can start the model after drop the optional, then i test it, i met this error:

the server error: CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000

I1116 09:18:02.879483 1062 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 09:18:03.167786 1062 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 09:18:03.167808 1062 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.167814 1062 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 09:18:03.381970: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 09:18:03.426128 1062 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 09:18:03.426152 1062 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.426158 1062 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 09:18:03.426162 1062 tensorflow.cc:1917] backend configuration: {} I1116 09:18:03.427825 1062 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 09:18:03.427845 1062 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.427850 1062 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 09:18:03.701320 1062 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7f3e98000000' with size 268435456 I1116 09:18:03.702576 1062 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 I1116 09:18:03.707453 1062 model_repository_manager.cc:787] loading: fastertransformer:1 I1116 09:18:03.851516 1062 libfastertransformer.cc:1479] TRITONBACKEND_Initialize: fastertransformer I1116 09:18:03.851533 1062 libfastertransformer.cc:1489] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.851537 1062 libfastertransformer.cc:1495] 'fastertransformer' TRITONBACKEND API version: 1.0 I1116 09:18:03.852231 1062 libfastertransformer.cc:1527] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1116 09:18:03.853446 1062 libfastertransformer.cc:219] Instance group type: KIND_CPU count: 1 I1116 09:18:03.855113 1062 libfastertransformer.cc:249] Sequence Batching: disabled I1116 09:18:03.855365 1062 libfastertransformer.cc:421] Before Loading Weights: after allocation : free: 30.43 GB, total: 31.75 GB, used: 1.32 GB I1116 09:18:08.910025 1062 libfastertransformer.cc:431] After Loading Weights: after allocation : free: 28.91 GB, total: 31.75 GB, used: 2.84 GB I1116 09:18:08.910276 1062 libfastertransformer.cc:452] Before Loading Model: after allocation : free: 28.91 GB, total: 31.75 GB, used: 2.84 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP [WARNING] gemm_config.in is not found; using default GEMM algo I1116 09:18:09.367633 1062 libfastertransformer.cc:466] After Loading Model: after allocation : free: 28.76 GB, total: 31.75 GB, used: 2.99 GB I1116 09:18:09.367784 1062 libfastertransformer.cc:713] Model instance is created on GPU [ 0 ] I1116 09:18:09.367802 1062 libfastertransformer.cc:1591] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (count 1) (instance_id 0) I1116 09:18:09.367935 1062 model_repository_manager.cc:960] successfully loaded 'fastertransformer' version 1 I1116 09:18:09.368035 1062 server.cc:490] +-------------------+-----------------------------------------------------------------------------+------+ | Backend | Config | Path | +-------------------+-----------------------------------------------------------------------------+------+ | pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} | | tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} | | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {} | +-------------------+-----------------------------------------------------------------------------+------+

I1116 09:18:09.368075 1062 server.cc:533] +-------------------+---------+--------+ | Model | Version | Status | +-------------------+---------+--------+ | fastertransformer | 1 | READY | +-------------------+---------+--------+

I1116 09:18:09.368151 1062 tritonserver.cc:1620] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.6.0 | | server_extensions | classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics | | model_repository_path[0] | /workspace/build/fastertransformer_backend/all_models/t5_545000 | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

I1116 09:18:09.369396 1062 grpc_server.cc:3979] Started GRPCInferenceService at 0.0.0.0:8001 I1116 09:18:09.369643 1062 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000 I1116 09:18:09.449811 1062 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002 I1116 09:18:34.172028 1062 libfastertransformer.cc:1091] Start to forward terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:431


T5Decoding.cc 431 FT_CHECK(decoding_weights->post_decoder_embedding.bias != nullptr);


the test error: python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32
/usr/local/lib/python3.8/dist-packages/transformers/models/t5/tokenization_t5.py:164: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with truncation is True.

_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/t5_utils/t5_end_to_end_test.py", line 211, in translate(vars(args)) File "tools/t5_utils/t5_end_to_end_test.py", line 150, in translate result = client.infer(model_name, inputs) File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 1482, in infer response = self._post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 309, in _post response = self._client_stub.post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post return self.request(METHOD_POST, request_uri, body=body, headers=headers) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 226, in request sock = self._connection_pool.get_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 166, in get_socket return self._create_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 127, in _create_socket raise first_error File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 114, in _create_socket sock = self._connect_socket(sock, sock_info[-1]) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 136, in _connect_socket sock.connect(address) File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 607, in connect raise _SocketError(err, strerror(err)) ConnectionRefusedError: [Errno 111] Connection refused

byshiue commented 1 year ago

You can change

        FT_CHECK(decoding_weights->post_decoder_embedding.bias != nullptr);
        cudaD2Dcpy(padded_post_decoder_embedding_bias_, decoding_weights->post_decoder_embedding.bias, vocab_size_);

to

        if (decoding_weights->post_decoder_embedding.bias != nullptr) {
            cudaD2Dcpy(padded_post_decoder_embedding_bias_,
                        decoding_weights->post_decoder_embedding.bias,
                        vocab_size_,
                        stream_);
        }
520jefferson commented 1 year ago

step1: I modify the codes like you said then rebuild it, then i met this error below (it sames the num of parameters in cudaD2Dcpy is more than required --- based on fastertransformer v5.1.1):

[ 90%] Building CXX object _deps/repo-ft-build/src/fastertransformer/models/multi_gpu_gpt/CMakeFiles/ParallelGpt.dir/ParallelGpt.cc.o /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc: In instantiation of 'void fastertransformer::T5Decoding::forward(std::unordered_map<std::cxx11::basic_string, fastertransformer::Tensor>*, const std::unordered_map<std::cxx11::basic_string, fastertransformer::Tensor>, const fastertransformer::T5DecodingWeight) [with T = float]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:899:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: error: no matching function for call to 'cudaD2Dcpy(float&, const float const&, const size_t&, CUstream_st&)' 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/FfnLayer.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/TensorParallelGeluFfnLayer.h:19, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoder.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: candidate: 'template void fastertransformer::cudaD2Dcpy(T, const T, int)' 42 | void cudaD2Dcpy(T tgt, const T src, const int size); | ^~~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: template argument deduction/substitution failed: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: note: candidate expects 3 arguments, 4 provided 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc: In instantiation of 'void fastertransformer::T5Decoding::forward(std::unordered_map<std::__cxx11::basic_string, fastertransformer::Tensor>, const std::unordered_map<std::__cxx11::basic_string, fastertransformer::Tensor>, const fastertransformer::T5DecodingWeight) [with T = half]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:900:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: error: no matching function for call to 'cudaD2Dcpy(half&, const __half const&, const size_t&, CUstream_st&)' In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/FfnLayer.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/TensorParallelGeluFfnLayer.h:19, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoder.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: candidate: 'template void fastertransformer::cudaD2Dcpy(T, const T, int)' 42 | void cudaD2Dcpy(T tgt, const T* src, const int size); | ^~~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: template argument deduction/substitution failed: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: note: candidate expects 3 arguments, 4 provided 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ make[2]: [_deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/T5Decoding.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4778: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/all] Error 2 make[1]: Waiting for unfinished jobs....


step2: then i replace cudaD2Dcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_);
with check_cuda_error(cudaMemcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_));

then i met this error below. what should i do ?


/workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:901:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:138: error: cannot convert 'cudaStream_t' {aka 'CUstream_st*'} to 'cudaMemcpyKind' 434 check_cuda_error(cudaMemcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_)); ^~~
cudaStream_t {aka CUstream_st*}

/workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:117:38: note: in definition of macro 'check_cuda_error' 117 | #define check_cuda_error(val) check((val), #val, FILE, LINE) | ^~~ In file included from /usr/local/cuda/include/channel_descriptor.h:61, from /usr/local/cuda/include/cuda_runtime.h:95, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/decoding_kernels.h:20, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:22, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /usr/local/cuda/include/cuda_runtime_api.h:5781:112: note: initializing argument 4 of 'cudaError_t cudaMemcpy(void, const void, size_t, cudaMemcpyKind)' 5781 | extern host cudaError_t CUDARTAPI cudaMemcpy(void dst, const void src, size_t count, enum cudaMemcpyKind kind); | ~~~~^~~~ make[2]: [_deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/T5Decoding.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4778: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/all] Error 2 make[1]: Waiting for unfinished jobs....


step3: I just delete the fourth parameter "stream_" in step1 as you told me initially, then i build it successfully. i don't know whether it is okay. then i start the server and test it. maybe the tokenizer is not good but I got the belu score 0.00, it's werid, it will test it with t5 base model instead of my own model.

server start log

I1117 06:12:35.962853 258 server.cc:533] +-------------------+---------+--------+ | Model | Version | Status | +-------------------+---------+--------+ | fastertransformer | 1 | READY | +-------------------+---------+--------+

test script and log

/workspace/fastertransformer_backend# python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

... set request get request bleu score: 0.00 bleu counts: [0, 0, 0, 0] bleu totals: [0, 0, 0, 0] bleu precisions: [0.0, 0.0, 0.0, 0.0] bleu sys_len: 0; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 34.75 sec to translate 0 tokens, BLEU score: 0.00, 0 tokens/sec.

then i use the t5-small model --- git lfs clone https://huggingface.co/t5-small then i get the right bleu looks good: set request get request bleu score: 25.36 bleu counts: [35698, 18410, 10667, 6417] bleu totals: [62022, 59018, 56014, 53010] bleu precisions: [57.556995904678985, 31.19387305567793, 19.043453422358695, 12.105263157894736] bleu sys_len: 62022; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 9.06 sec to translate 62022 tokens, BLEU score: 25.36, 6847 tokens/sec.

Thanks!