520jefferson commented 1 year ago

Description

branch: fastertransformer_backend-release-v1.2.1_tag/
triton with ft container verion: 22.07
gpu:v100 
model:huggingface t5-base

Reproduced Steps

After build the docker , then I use the huggingface origin t5-base , then i convert the model:
1,Convert model: 
python  ./build/fastertransformer_backend/build/_deps/repo-ft-src/examples/pytorch/t5/utils/t5_ckpt_convert.py   -o   /workspace/build/fastertransformer_backend/all_models/t5/fastertransformer/1 -i /FT5/t5-base/ -infer_gpu_num 1

2,start model 
export CUDA_VISIBLE_DEVICES=6
/workspace/build/fastertransformer_backend/all_models/t5/fastertransformer# mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5

the i met this error:
I1115 06:40:44.582619 7292 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbbd4000000' with size 268435456
I1115 06:40:44.583982 7292 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1115 06:40:44.595437 7292 model_repository_manager.cc:1206] loading: fastertransformer:1
I1115 06:40:44.685154 7292 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1115 06:40:44.685175 7292 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1115 06:40:44.685180 7292 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1115 06:40:44.685213 7292 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1115 06:40:44.686569 7292 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1115 06:40:44.686588 7292 libfastertransformer.cc:248] Sequence Batching: disabled
[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len. 
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35452,1],0]
  Exit code:    255
--------------------------------------------------------------------------

If i set relative_attention_num_buckets_or_max_pos_seq_len=32 in config.ini, then i will met this error:
[ERROR] Does not find the section encoder with name weight_data_type.

byshiue commented 1 year ago

Can you post the config.ini?

520jefferson commented 1 year ago

the config.ini is as follows, now i try to use (/workspace# python3 ./build/fastertransformer_backend/build/_deps/repo-ft-src/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py -in_file /wlj/545000 -saved_dir /workspace/build/fastertransformer_backend/all_models/t5_545000/1 -inference_tensor_para_size 1) to convert model, maybe i shouldn't use t5_ckpt_convert.py to convert model?

[encoder] vocab_size = 33795 d_model = 768 d_kv = 64 d_ff = 3072 num_layers = 12 num_decoder_layers = 12 num_heads = 12 relative_attention_num_buckets = 32 relative_attention_max_distance = 128 dropout_rate = 0.1 layer_norm_epsilon = 1e-06 initializer_factor = 1.0 feed_forward_proj = relu use_cache = False dense_act_fn = relu is_gated_act = False return_dict = True output_hidden_states = False output_attentions = False torchscript = False torch_dtype = float32 use_bfloat16 = False tf_legacy_loss = False pruned_heads = {} tie_word_embeddings = True is_encoder_decoder = False is_decoder = False cross_attention_hidden_size = None add_cross_attention = False tie_encoder_decoder = False max_length = 20 min_length = 0 do_sample = False early_stopping = False num_beams = 1 num_beam_groups = 1 diversity_penalty = 0.0 temperature = 1.0 top_k = 50 top_p = 1.0 typical_p = 1.0 repetition_penalty = 1.0 length_penalty = 1.0 no_repeat_ngram_size = 0 encoder_no_repeat_ngram_size = 0 bad_words_ids = None num_return_sequences = 1 chunk_size_feed_forward = 0 output_scores = False return_dict_in_generate = False forced_bos_token_id = None forced_eos_token_id = None remove_invalid_values = False exponential_decay_length_penalty = None suppress_tokens = None begin_suppress_tokens = None architectures = ['T5ForConditionalGeneration'] finetuning_task = None id2label = {0: 'LABEL_0', 1: 'LABEL_1'} label2id = {'LABEL_0': 0, 'LABEL_1': 1} tokenizer_class = None prefix = None bos_token_id = None pad_token_id = 3 eos_token_id = 33606 sep_token_id = None decoder_start_token_id = 33605 task_specific_params = None problem_type = None _name_or_path = /wlj/545000/ transformers_version = 4.24.0 model_type = t5 n_positions = 512 output_past = True

[decoder] vocab_size = 33795 d_model = 768 d_kv = 64 d_ff = 3072 num_layers = 12 num_decoder_layers = 12 num_heads = 12 relative_attention_num_buckets = 32 relative_attention_max_distance = 128 dropout_rate = 0.1 layer_norm_epsilon = 1e-06 initializer_factor = 1.0 feed_forward_proj = relu use_cache = True dense_act_fn = relu is_gated_act = False return_dict = True output_hidden_states = False output_attentions = False torchscript = False torch_dtype = float32 use_bfloat16 = False tf_legacy_loss = False pruned_heads = {} tie_word_embeddings = True is_encoder_decoder = False is_decoder = True cross_attention_hidden_size = None add_cross_attention = False tie_encoder_decoder = False max_length = 20 min_length = 0 do_sample = False early_stopping = False num_beams = 1 num_beam_groups = 1 diversity_penalty = 0.0 temperature = 1.0 top_k = 50 top_p = 1.0 typical_p = 1.0 repetition_penalty = 1.0 length_penalty = 1.0 no_repeat_ngram_size = 0 encoder_no_repeat_ngram_size = 0 bad_words_ids = None num_return_sequences = 1 chunk_size_feed_forward = 0 output_scores = False return_dict_in_generate = False forced_bos_token_id = None forced_eos_token_id = None remove_invalid_values = False exponential_decay_length_penalty = None suppress_tokens = None begin_suppress_tokens = None architectures = ['T5ForConditionalGeneration'] finetuning_task = None id2label = {0: 'LABEL_0', 1: 'LABEL_1'} label2id = {'LABEL_0': 0, 'LABEL_1': 1} tokenizer_class = None prefix = None bos_token_id = None pad_token_id = 3 eos_token_id = 33606 sep_token_id = None decoder_start_token_id = 33605 task_specific_params = None problem_type = None _name_or_path = /wlj/545000/ transformers_version = 4.24.0 model_type = t5 n_positions = 512 output_past = True

byshiue commented 1 year ago

If you your model is from HF, you should use huggingface_t5_ckpt_converter.py

520jefferson commented 1 year ago

The v100 cuda version is 11.1, but the docker cuda version is 11.7(22.07 container)

After convert the model with huggingface_t5_ckpt_converter.py, then i start the server,i met this, it sames the cuda11.7 is not compatible with cuda11.1, so i should rebuild the with cuda 11.1 inner the docker like https://github.com/triton-inference-server/fastertransformer_backend#rebuilding-fastertransformer-backend-optional after i modify the cuda11.7 to 11.1 in the docker? or maybe i should rebuild the ft with another container which's cuda version is not bigger than 11.1?

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000I1115 07:33:28.755888 8867 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fee7e000000' with size 268435456 I1115 07:33:28.757518 8867 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I1115 07:33:28.768517 8867 model_repository_manager.cc:1206] loading: fastertransformer:1 I1115 07:33:28.866421 8867 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer I1115 07:33:28.866448 8867 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10 I1115 07:33:28.866453 8867 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10 I1115 07:33:28.866490 8867 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1115 07:33:28.867923 8867 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1 I1115 07:33:28.867942 8867 libfastertransformer.cc:248] Sequence Batching: disabled I1115 07:33:28.868163 8867 libfastertransformer.cc:420] Before Loading Weights: after allocation : free: 11.92 GB, total: 31.75 GB, used: 19.83 GB I1115 07:33:30.377086 8867 libfastertransformer.cc:430] After Loading Weights: after allocation : free: 11.24 GB, total: 31.75 GB, used: 20.51 GB W1115 07:33:30.377193 8867 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend I1115 07:33:30.379198 8867 libfastertransformer.cc:451] Before Loading Model: after allocation : free: 11.24 GB, total: 31.75 GB, used: 20.51 GB terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] CUDA runtime error: API call is not supported in the installed CUDA driver /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:157

byshiue commented 1 year ago

I don't understand the meaning of "The v100 cuda version is 11.1".

You can re-compile FT by cuda 11.1 in the docker. But I am not sure can you use cuda 11.1 to launch the server because you don't recompile the triton server.

520jefferson commented 1 year ago

"The v100 cuda version is 11.1" means the i use the nvida-smi, the i can see the cuda version is 11.1: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+

Thanks a lot, i will rebuild all with another container version.

520jefferson commented 1 year ago

When i build with 20.12 container, i met this: fatal: unable to access 'http://github.com/triton-inference-server/core.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated. Cloning into 'repo-core-src'... fatal: unable to access 'http://github.com/triton-inference-server/core.git/': Failed to connect to github.com port 443: Connection timed out

the git access is not oaky,Is there an alternative?

byshiue commented 1 year ago

For question about triton server, please ask in https://github.com/triton-inference-server/server.

520jefferson commented 1 year ago

I rebuild the docker in v100 machine, with container version 20.12 like this, but i met a error which i never see before. --------------------------run.sh------------------------------------------- cd fastertransformer_backend-release-v1.2.1_tag export WORKSPACE=$(pwd) export CONTAINER_VERSION=20.12 export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

prepare docker image

docker build --rm \ --build-arg TRITON_VERSION=${CONTAINER_VERSION} \ -t ${TRITON_DOCKER_IMAGE} \ -f docker/Dockerfile \ .

[ 65%] Linking CXX static library ../../../../../../lib/libBertINT8.a [ 65%] Built target BertINT8 Scanning dependencies of target bert_int8_example [ 66%] Building CXX object _deps/repo-ft-build/examples/cpp/bert_int8/CMakeFiles/bert_int8_example.dir/bert_int8_example.cc.o In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/vit_int8/ViTINT8.h:26, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/vit_int8/ViTINT8.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/conv2d.h: In function 'void fastertransformer::conv2d(T, const T, const T, int, int, int, int, int, int, int, cudnnContext&)': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/conv2d.h:50:20: error: 'CUDNN_DATA_BFLOAT16' was not declared in this scope; did you mean 'CUDNN_DATA_FLOAT'? 50 | dataType = CUDNN_DATA_BFLOAT16; | ^~~~~~~ | CUDNN_DATA_FLOAT [ 66%] Linking CUDA executable ../../../../bin/test_penalty_kernels [ 66%] Built target test_penalty_kernels [ 66%] Linking CUDA executable ../../../../bin/test_gpt_kernels [ 67%] Linking CUDA executable ../../../../bin/test_logprob_kernels make[2]: [_deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/ViTINT8.cc.o] Error 1 make[1]: [CMakeFiles/Makefile2:5877: _deps/repo-ft-build/src/fastertransformer/models/vit_int8/CMakeFiles/ViTINT8.dir/all] Error 2 make[1]: Waiting for unfinished jobs.... [ 67%] Built target test_gpt_kernels ... ... ... /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/sampling_topp_kernels.h:110:8: note: 'struct fastertransformer::segmented_topp_impl::TopKPerSegmentContext' declared here 110 | struct TopKPerSegmentContext { | ^~~~~ [ 72%] Linking CUDA device code CMakeFiles/sampling_topp_kernels.dir/cmake_device_link.o [ 72%] Linking CUDA static library ../../../../../lib/libsampling_topp_kernels.a [ 72%] Built target sampling_topp_kernels [ 72%] Linking CUDA device code CMakeFiles/online_softmax_beamsearch_kernels.dir/cmake_device_link.o [ 73%] Linking CUDA static library ../../../../../lib/libonline_softmax_beamsearch_kernels.a [ 73%] Built target online_softmax_beamsearch_kernels make: [Makefile:149: all] Error 2 The command '/bin/sh -c mkdir build -p && cd build && cmake -D CMAKE_EXPORT_COMPILE_COMMANDS=1 -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/opt/tritonserver -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" .. && make -j"$(grep -c ^processor /proc/cpuinfo)" install' returned a non-zero code: 2

the total log in this file:https://drive.google.com/file/d/1rkUGmC9AG1_AlxFNtukF8vDD8SIkCMkI/view?usp=sharing

byshiue commented 1 year ago

The docker image is too old to have bfloat16 in CUDNN. You can find a flag to close the bfloat16 compile in FT cmake file.

520jefferson commented 1 year ago

I remove the follow config in cmakefile, and rebuild it , then i met the error below. i use fastertransformer v5.1.1，fastertransformer_backend-release-v1.2.1_tag, Maybe they don't match？

if(${CUDA_VERSION_MAJOR} VERSION_GREATER_EQUAL "11") add_definitions("-DENABLE_BF16") message("CUDA_VERSION ${CUDA_VERSION_MAJOR} is greater or equal than 11, enable -DENABLE_BF16 flag") endif()

[ 98%] Building CXX object _deps/repo-ft-build/examples/cpp/gptj/CMakeFiles/gptj_triton_example.dir/gptj_triton_example.cc.o [ 98%] Built target ParallelGptTritonBackend Scanning dependencies of target transformer-shared Scanning dependencies of target transformer-static [ 98%] Linking CUDA device code CMakeFiles/transformer-shared.dir/cmake_device_link.o [ 98%] Linking CUDA device code CMakeFiles/transformer-static.dir/cmake_device_link.o Scanning dependencies of target multi_gpu_gpt_triton_example [ 98%] Building CXX object _deps/repo-ft-build/examples/cpp/multi_gpu_gpt/CMakeFiles/multi_gpu_gpt_triton_example.dir/multi_gpu_gpt_triton_example.cc.o [ 99%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_interactive_example [ 99%] Built target gptj_example [ 99%] Built target gptneox_example [ 99%] Built target multi_gpu_gpt_example [ 99%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_async_example [ 99%] Linking CXX static library ../../lib/libtransformer-static.a [ 99%] Linking CXX shared library ../../lib/libtransformer-shared.so [ 99%] Built target multi_gpu_gpt_interactive_example [ 99%] Built target multi_gpu_gpt_async_example [ 99%] Built target transformer-shared Scanning dependencies of target triton-fastertransformer-backend [ 99%] Building CXX object CMakeFiles/triton-fastertransformer-backend.dir/src/libfastertransformer.cc.o [ 99%] Built target transformer-static /workspace/build/fastertransformer_backend/src/libfastertransformer.cc: In constructor 'triton::backend::fastertransformer_backend::ModelState::ModelState(TRITONBACKEND_Model)': /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:153:38: error: no matching function for call to 'triton::backend::BackendModel::BackendModel(TRITONBACKEND_Model&, bool)' 153 | : BackendModel(triton_model, true) | ^ In file included from /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:45: /workspace/build/fastertransformer_backend/build/_deps/repo-backend-src/include/triton/backend/backend_model.h:43:3: note: candidate: 'triton::backend::BackendModel::BackendModel(TRITONBACKEND_Model)' 43 | BackendModel(TRITONBACKEND_Model triton_model); | ^~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-backend-src/include/triton/backend/backend_model.h:43:3: note: candidate expects 1 argument, 2 provided [ 99%] Linking CXX executable ../../../../../bin/gptneox_triton_example [ 99%] Built target gptneox_triton_example [100%] Linking CXX executable ../../../../../bin/gptj_triton_example [100%] Built target gptj_triton_example [100%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_triton_example [100%] Built target multi_gpu_gpt_triton_example make[2]: [CMakeFiles/triton-fastertransformer-backend.dir/build.make:82: CMakeFiles/triton-fastertransformer-backend.dir/src/libfastertransformer.cc.o] Error 1 make[1]: [CMakeFiles/Makefile2:1449: CMakeFiles/triton-fastertransformer-backend.dir/all] Error 2 make[1]: Waiting for unfinished jobs.... /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h: In function 'bool test_context_sharing(const string&, const string&) [with T = float]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h:72:144: warning: 'pipeline_para.fastertransformer::NcclParam::nccluid' may be used uninitialized in this function [-Wmaybe-uninitialized] 72 | NcclParam(NcclParam const& param): | ^ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h:72:144: warning: 'tensor_para.fastertransformer::NcclParam::nccluid' may be used uninitialized in this function [-Wmaybe-uninitialized] 72 | NcclParam(NcclParam const& param): | ^ [100%] Linking CXX executable ../../../../bin/test_context_decoder_layer [100%] Built target test_context_decoder_layer [100%] Linking CXX executable ../../../../bin/test_sampling [100%] Built target test_sampling make: [Makefile:149: all] Error 2 The command '/bin/sh -c mkdir build -p && cd build && cmake -D CMAKE_EXPORT_COMPILE_COMMANDS=1 -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/opt/tritonserver -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" .. && make -j"$(grep -c ^processor /proc/cpuinfo)" install' returned a non-zero code: 2

the whole logs:https://drive.google.com/file/d/1B1_vRKtnJc_O1HTlH2BF6aA4G2ON9gu8/view?usp=sharing

byshiue commented 1 year ago

Need to change

BackendModel(triton_model, true)

of https://github.com/triton-inference-server/fastertransformer_backend/blob/main/src/libfastertransformer.cc#L153 to

BackendModel(triton_model)

520jefferson commented 1 year ago

I can rebuild successful. 1, but when i use the the config https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/t5/fastertransformer/config.pbtxt to start , i met this error below:

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000 I1116 08:58:32.709152 752 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 08:58:32.999338 752 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 08:58:32.999361 752 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 08:58:32.999366 752 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 08:58:33.213690: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 08:58:33.256948 752 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 08:58:33.256971 752 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 08:58:33.256976 752 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 08:58:33.256980 752 tensorflow.cc:1917] backend configuration: {} I1116 08:58:33.258658 752 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 08:58:33.258677 752 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 08:58:33.258682 752 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 08:58:33.530428 752 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7fd95c000000' with size 268435456 I1116 08:58:33.531686 752 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:317] Error parsing text-format inference.ModelConfig: 49:13: Message type "inference.ModelInput" has no field named "optional". E1116 08:58:33.535182 752 model_repository_manager.cc:1682] failed to read text proto from /workspace/build/fastertransformer_backend/all_models/t5_545000/fastertransformer/config.pbtxt I1116 08:58:33.535260 752 server.cc:490]

2,when i use a old config https://drive.google.com/file/d/1gFfC2MDKdyXLflufQQI0BgT51RgO08ag/view?usp=sharing i met another error below

I1116 09:06:16.007486 786 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 09:06:16.299886 786 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 09:06:16.299913 786 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.299918 786 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 09:06:16.518536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 09:06:16.563957 786 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 09:06:16.563982 786 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.563987 786 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 09:06:16.563991 786 tensorflow.cc:1917] backend configuration: {} I1116 09:06:16.565688 786 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 09:06:16.565710 786 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.565715 786 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 09:06:16.838932 786 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7fbb24000000' with size 268435456 I1116 09:06:16.840197 786 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 I1116 09:06:16.845078 786 model_repository_manager.cc:787] loading: fastertransformer:1 I1116 09:06:16.988110 786 libfastertransformer.cc:1479] TRITONBACKEND_Initialize: fastertransformer I1116 09:06:16.988127 786 libfastertransformer.cc:1489] Triton TRITONBACKEND API version: 1.0 I1116 09:06:16.988131 786 libfastertransformer.cc:1495] 'fastertransformer' TRITONBACKEND API version: 1.0 I1116 09:06:16.988839 786 libfastertransformer.cc:1527] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1116 09:06:16.989918 786 libfastertransformer.cc:219] Instance group type: KIND_CPU count: 1 I1116 09:06:16.991597 786 libfastertransformer.cc:249] Sequence Batching: disabled E1116 09:06:16.991613 786 libfastertransformer.cc:374] Invalid configuration argument 'data_type': I1116 09:06:16.991617 786 libfastertransformer.cc:421] Before Loading Weights: after allocation : free: 30.43 GB, total: 31.75 GB, used: 1.32 GB

byshiue commented 1 year ago

old triton does not support optional input. You need to remove them from config file.

520jefferson commented 1 year ago

I can start the model after drop the optional, then i test it, i met this error:

the server error: CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5_545000

I1116 09:18:02.879483 1062 metrics.cc:221] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB I1116 09:18:03.167786 1062 libtorch.cc:945] TRITONBACKEND_Initialize: pytorch I1116 09:18:03.167808 1062 libtorch.cc:955] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.167814 1062 libtorch.cc:961] 'pytorch' TRITONBACKEND API version: 1.0 2022-11-16 09:18:03.381970: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I1116 09:18:03.426128 1062 tensorflow.cc:1877] TRITONBACKEND_Initialize: tensorflow I1116 09:18:03.426152 1062 tensorflow.cc:1887] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.426158 1062 tensorflow.cc:1893] 'tensorflow' TRITONBACKEND API version: 1.0 I1116 09:18:03.426162 1062 tensorflow.cc:1917] backend configuration: {} I1116 09:18:03.427825 1062 onnxruntime.cc:1715] TRITONBACKEND_Initialize: onnxruntime I1116 09:18:03.427845 1062 onnxruntime.cc:1725] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.427850 1062 onnxruntime.cc:1731] 'onnxruntime' TRITONBACKEND API version: 1.0 I1116 09:18:03.701320 1062 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x7f3e98000000' with size 268435456 I1116 09:18:03.702576 1062 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864 I1116 09:18:03.707453 1062 model_repository_manager.cc:787] loading: fastertransformer:1 I1116 09:18:03.851516 1062 libfastertransformer.cc:1479] TRITONBACKEND_Initialize: fastertransformer I1116 09:18:03.851533 1062 libfastertransformer.cc:1489] Triton TRITONBACKEND API version: 1.0 I1116 09:18:03.851537 1062 libfastertransformer.cc:1495] 'fastertransformer' TRITONBACKEND API version: 1.0 I1116 09:18:03.852231 1062 libfastertransformer.cc:1527] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I1116 09:18:03.853446 1062 libfastertransformer.cc:219] Instance group type: KIND_CPU count: 1 I1116 09:18:03.855113 1062 libfastertransformer.cc:249] Sequence Batching: disabled I1116 09:18:03.855365 1062 libfastertransformer.cc:421] Before Loading Weights: after allocation : free: 30.43 GB, total: 31.75 GB, used: 1.32 GB I1116 09:18:08.910025 1062 libfastertransformer.cc:431] After Loading Weights: after allocation : free: 28.91 GB, total: 31.75 GB, used: 2.84 GB I1116 09:18:08.910276 1062 libfastertransformer.cc:452] Before Loading Model: after allocation : free: 28.91 GB, total: 31.75 GB, used: 2.84 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP [WARNING] gemm_config.in is not found; using default GEMM algo I1116 09:18:09.367633 1062 libfastertransformer.cc:466] After Loading Model: after allocation : free: 28.76 GB, total: 31.75 GB, used: 2.99 GB I1116 09:18:09.367784 1062 libfastertransformer.cc:713] Model instance is created on GPU [ 0 ] I1116 09:18:09.367802 1062 libfastertransformer.cc:1591] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (count 1) (instance_id 0) I1116 09:18:09.367935 1062 model_repository_manager.cc:960] successfully loaded 'fastertransformer' version 1 I1116 09:18:09.368035 1062 server.cc:490] +-------------------+-----------------------------------------------------------------------------+------+ | Backend | Config | Path | +-------------------+-----------------------------------------------------------------------------+------+ | pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} | | tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} | | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {} | +-------------------+-----------------------------------------------------------------------------+------+

I1116 09:18:09.368151 1062 tritonserver.cc:1620] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.6.0 | | server_extensions | classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics | | model_repository_path[0] | /workspace/build/fastertransformer_backend/all_models/t5_545000 | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

I1116 09:18:09.369396 1062 grpc_server.cc:3979] Started GRPCInferenceService at 0.0.0.0:8001 I1116 09:18:09.369643 1062 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000 I1116 09:18:09.449811 1062 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002 I1116 09:18:34.172028 1062 libfastertransformer.cc:1091] Start to forward terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:431

T5Decoding.cc 431 FT_CHECK(decoding_weights->post_decoder_embedding.bias != nullptr);

the test error: python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32
/usr/local/lib/python3.8/dist-packages/transformers/models/t5/tokenization_t5.py:164: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with truncation is True.

Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with model_max_length or pass max_length when encoding/padding.
To avoid this warning, please instantiate this tokenizer with model_max_length set to your preferred value. warnings.warn( The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'T5Tokenizer'. The class this function is called from is 'PreTrainedTokenizerFast'. set request Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 163, in get_socket return self._socket_queue.get(block=False) File "src/gevent/queue.py", line 335, in gevent._gevent_cqueue.Queue.get File "src/gevent/queue.py", line 350, in gevent._gevent_cqueue.Queue.get File "src/gevent/queue.py", line 319, in gevent._gevent_cqueue.Queue._Queue__get_or_peek

_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tools/t5_utils/t5_end_to_end_test.py", line 211, in translate(vars(args)) File "tools/t5_utils/t5_end_to_end_test.py", line 150, in translate result = client.infer(model_name, inputs) File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 1482, in infer response = self._post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 309, in _post response = self._client_stub.post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post return self.request(METHOD_POST, request_uri, body=body, headers=headers) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 226, in request sock = self._connection_pool.get_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 166, in get_socket return self._create_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 127, in _create_socket raise first_error File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 114, in _create_socket sock = self._connect_socket(sock, sock_info[-1]) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 136, in _connect_socket sock.connect(address) File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 607, in connect raise _SocketError(err, strerror(err)) ConnectionRefusedError: [Errno 111] Connection refused

byshiue commented 1 year ago

You can change

        FT_CHECK(decoding_weights->post_decoder_embedding.bias != nullptr);
        cudaD2Dcpy(padded_post_decoder_embedding_bias_, decoding_weights->post_decoder_embedding.bias, vocab_size_);

to

        if (decoding_weights->post_decoder_embedding.bias != nullptr) {
            cudaD2Dcpy(padded_post_decoder_embedding_bias_,
                        decoding_weights->post_decoder_embedding.bias,
                        vocab_size_,
                        stream_);
        }

520jefferson commented 1 year ago

step1: I modify the codes like you said then rebuild it, then i met this error below (it sames the num of parameters in cudaD2Dcpy is more than required --- based on fastertransformer v5.1.1):

[ 90%] Building CXX object _deps/repo-ft-build/src/fastertransformer/models/multi_gpu_gpt/CMakeFiles/ParallelGpt.dir/ParallelGpt.cc.o /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc: In instantiation of 'void fastertransformer::T5Decoding::forward(std::unordered_map<std::cxx11::basic_string, fastertransformer::Tensor>*, const std::unordered_map<std::cxx11::basic_string, fastertransformer::Tensor>, const fastertransformer::T5DecodingWeight) [with T = float]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:899:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: error: no matching function for call to 'cudaD2Dcpy(float&, const float const&, const size_t&, CUstream_st&)' 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/FfnLayer.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/TensorParallelGeluFfnLayer.h:19, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoder.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: candidate: 'template void fastertransformer::cudaD2Dcpy(T, const T, int)' 42 | void cudaD2Dcpy(T tgt, const T src, const int size); | ^~~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: template argument deduction/substitution failed: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: note: candidate expects 3 arguments, 4 provided 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc: In instantiation of 'void fastertransformer::T5Decoding::forward(std::unordered_map<std::__cxx11::basic_string, fastertransformer::Tensor>, const std::unordered_map<std::__cxx11::basic_string, fastertransformer::Tensor>, const fastertransformer::T5DecodingWeight) [with T = half]': /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:900:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: error: no matching function for call to 'cudaD2Dcpy(half&, const __half const&, const size_t&, CUstream_st&)' In file included from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/FfnLayer.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/TensorParallelGeluFfnLayer.h:19, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoder.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:24, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: candidate: 'template void fastertransformer::cudaD2Dcpy(T, const T, int)' 42 | void cudaD2Dcpy(T tgt, const T* src, const int size); | ^~~~~~ /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.h:42:6: note: template argument deduction/substitution failed: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:23: note: candidate expects 3 arguments, 4 provided 434 | cudaD2Dcpy(padded_post_decoder_embeddingbias, | ~~^~~~~~~~~ 435 | decoding_weights->post_decoder_embedding.bias, | ~~~~~~~~~~ 436 | vocabsize, | ~~~~ 437 | stream_); | ~~~~ make[2]: [_deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/T5Decoding.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4778: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/all] Error 2 make[1]: Waiting for unfinished jobs....

step2: then i replace cudaD2Dcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_);
with check_cuda_error(cudaMemcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_));

then i met this error below. what should i do ?

/workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:901:16: required from here /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:434:138: error: cannot convert 'cudaStream_t' {aka 'CUstream_st*'} to 'cudaMemcpyKind' 434	check_cuda_error(cudaMemcpy(padded_post_decoder_embeddingbias, decoding_weights->post_decoder_embedding.bias, vocabsize, stream_));	^~~
cudaStream_t {aka CUstream_st*}

/workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:117:38: note: in definition of macro 'check_cuda_error' 117 | #define check_cuda_error(val) check((val), #val, FILE, LINE) | ^~~ In file included from /usr/local/cuda/include/channel_descriptor.h:61, from /usr/local/cuda/include/cuda_runtime.h:95, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/decoding_kernels.h:20, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.h:22, from /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:17: /usr/local/cuda/include/cuda_runtime_api.h:5781:112: note: initializing argument 4 of 'cudaError_t cudaMemcpy(void, const void, size_t, cudaMemcpyKind)' 5781 | extern host cudaError_t CUDARTAPI cudaMemcpy(void dst, const void src, size_t count, enum cudaMemcpyKind kind); | ~~~~^~~~ make[2]: [_deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/build.make:82: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/T5Decoding.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4778: _deps/repo-ft-build/src/fastertransformer/models/t5/CMakeFiles/T5Decoding.dir/all] Error 2 make[1]: Waiting for unfinished jobs....

step3: I just delete the fourth parameter "stream_" in step1 as you told me initially, then i build it successfully. i don't know whether it is okay. then i start the server and test it. maybe the tokenizer is not good but I got the belu score 0.00, it's werid, it will test it with t5 base model instead of my own model.

server start log

test script and log

/workspace/fastertransformer_backend# python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

... set request get request bleu score: 0.00 bleu counts: [0, 0, 0, 0] bleu totals: [0, 0, 0, 0] bleu precisions: [0.0, 0.0, 0.0, 0.0] bleu sys_len: 0; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 34.75 sec to translate 0 tokens, BLEU score: 0.00, 0 tokens/sec.

then i use the t5-small model --- git lfs clone https://huggingface.co/t5-small then i get the right bleu looks good: set request get request bleu score: 25.36 bleu counts: [35698, 18410, 10667, 6417] bleu totals: [62022, 59018, 56014, 53010] bleu precisions: [57.556995904678985, 31.19387305567793, 19.043453422358695, 12.105263157894736] bleu sys_len: 62022; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 9.06 sec to translate 62022 tokens, BLEU score: 25.36, 6847 tokens/sec.

triton-inference-server / fastertransformer_backend

[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len #74

Description

Reproduced Steps

prepare docker image

if(${CUDA_VERSION_MAJOR} VERSION_GREATER_EQUAL "11") add_definitions("-DENABLE_BF16") message("CUDA_VERSION ${CUDA_VERSION_MAJOR} is greater or equal than 11, enable -DENABLE_BF16 flag") endif()

server start log

test script and log

/workspace/fastertransformer_backend# python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

Thanks!