All flan-t5 doesn't work for me

PetroMaslov commented 1 year ago

Description

Hi everyone!
I tried to reproduce the code from https://github.com/triton-inference-server/fastertransformer_backend/blob/dev/t5_gptj_blog/notebooks/GPT-J_and_T5_inference.ipynb.
I couldn't use any of flan-t5 models.

Reproduced Steps

I used main branch for FasterTransformer to convert the model to ft format. 
During converting any of flan-t5 models I got the warning:

Not save encoder.embed_tokens.weight, using shared.weight directly.
Not save decoder.embed_tokens.weight, using shared.weight directly.

But the model was converted with these warnings. When I ran it using triton, I got super strange answer like "sistem sistem sistem ..."

byshiue commented 1 year ago

Please provide the reproduced steps.

Chris113113 commented 1 year ago

@byshiue I am seeing the same behaviors. My reproduced steps below:

Conversion:

git clone --depth 1 --branch release/v5.3_tag https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
cmake -DSM=60,61,70,75,80,86 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
make -j12
git lfs clone https://huggingface.co/google/flan-t5-xxl
pip install -r ../examples/pytorch/t5/requirement.txt
python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -saved_dir /workspace/all_models/1/ \
        -in_file flan-t5-xxl \
        -inference_tensor_para_size 4\
        -weight_data_type fp16

This model is then uploaded and consumed in another container.

Inferencing container is using fastertransformer_backend based on triton:22.09.

/opt/tritonserver/bin/tritonserver", f"--model-repository={model_dir}", "--allow-vertex-ai=false", "--allow-http=true", "--http-port=8000"

I0517 20:54:19.018947 42 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f32c8000000' with size 268435456
I0517 20:54:19.021759 42 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0517 20:54:19.021785 42 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0517 20:54:19.021792 42 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0517 20:54:19.021797 42 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0517 20:54:19.681873 42 model_lifecycle.cc:459] loading: fastertransformer:1
I0517 20:54:19.877877 42 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer
I0517 20:54:19.877916 42 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.10
I0517 20:54:19.877923 42 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.10
I0517 20:54:19.880255 42 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0517 20:54:19.881399 42 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1
I0517 20:54:19.881424 42 libfastertransformer.cc:402] Sequence Batching: disabled
I0517 20:54:19.881448 42 libfastertransformer.cc:412] Dynamic Batching: disabled
I0517 20:54:19.881731 42 libfastertransformer.cc:438] Before Loading Weights:
I0517 20:54:24.251597 140418994583296 _internal.py:224] 10.56.4.1 - - [17/May/2023 20:54:24] "GET /health HTTP/1.1" 200 -
after allocation    : free: 15.29 GB, total: 15.77 GB, used:  0.49 GB
I0517 20:54:30.392812 42 libfastertransformer.cc:448] After Loading Weights:
W0517 20:54:30.392962 42 libfastertransformer.cc:572] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
after allocation    : free:  9.48 GB, total: 15.77 GB, used:  6.29 GB
[FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
I0517 20:54:31.534883 42 libfastertransformer.cc:472] Before Loading Model:
after allocation    : free:  9.16 GB, total: 15.77 GB, used:  6.62 GB
after allocation    : free:  9.34 GB, total: 15.77 GB, used:  6.43 GB
after allocation    : free:  9.34 GB, total: 15.77 GB, used:  6.43 GB
after allocation    : free:  9.16 GB, total: 15.77 GB, used:  6.62 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
after allocation    : free:  9.17 GB, total: 15.77 GB, used:  6.60 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
after allocation    : free:  8.98 GB, total: 15.77 GB, used:  6.79 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0517 20:54:32.846577 42 libfastertransformer.cc:489] After Loading Model:
after allocation    : free:  9.17 GB, total: 15.77 GB, used:  6.60 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0517 20:54:33.102062 42 libfastertransformer.cc:824] Model instance is created on GPU [ 0 1 2 3 ]
I0517 20:54:33.102099 42 libfastertransformer.cc:1940] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (count 1) (instance_id 0)
after allocation    : free:  8.98 GB, total: 15.77 GB, used:  6.79 GB
I0517 20:54:33.102468 42 model_lifecycle.cc:693] successfully loaded 'fastertransformer' version 1
I0517 20:54:33.102593 42 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0517 20:54:33.102683 42 server.cc:590]
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend           | Path                                                                        | Config                                                                                                                                                        |
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
+-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0517 20:54:33.102713 42 server.cc:633]
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1       | READY  |
+-------------------+---------+--------+

I0517 20:54:33.116054 42 metrics.cc:864] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB
I0517 20:54:33.116087 42 metrics.cc:864] Collecting metrics for GPU 1: Tesla V100-SXM2-16GB
I0517 20:54:33.116097 42 metrics.cc:864] Collecting metrics for GPU 2: Tesla V100-SXM2-16GB
I0517 20:54:33.116106 42 metrics.cc:864] Collecting metrics for GPU 3: Tesla V100-SXM2-16GB
I0517 20:54:33.117003 42 metrics.cc:757] Collecting CPU metrics
I0517 20:54:33.117188 42 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                               |
| server_version                   | 2.26.0                                                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0]         | /workspace/all_models/flan-t5-xxl/                                                                                                                                                                   |
| model_control_mode               | MODE_NONE                                                                                                                                                                                            |
| strict_model_config              | 0                                                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                             |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                                             |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0517 20:54:33.118868 42 grpc_server.cc:4820] Started GRPCInferenceService at 0.0.0.0:8001
I0517 20:54:33.119158 42 http_server.cc:3474] Started HTTPService at 0.0.0.0:8000
I0517 20:54:33.160478 42 http_server.cc:181] Started Metrics Service at 0.0.0.0:8002

Inferencing sample: "Summarize: Sandwiched between a second-hand bookstore and record shop in Cape Town's charmingly grungy suburb of Observatory is a blackboard reading 'Tapi Tapi -- Handcrafted, authentic African ice cream.' The parlor has become one of Cape Town's most talked about food establishments since opening in October 2020. And in its tiny kitchen, Jeff is creating ice cream flavors like no one else. Handwritten in black marker on the shiny kitchen counter are today's options: Salty kapenta dried fish (blitzed), toffee and scotch bonnet chile Sun-dried blackjack greens and caramel, Malted millet ,Hibiscus, cloves and anise. Using only flavors indigenous to the African continent, Guzha's ice cream has become the tool through which he is reframing the narrative around African food. 'This (is) ice cream for my identity, for other people's sake,' Jeff tells CNN. 'I think the (global) food story doesn't have much space for Africa ... unless we're looking at the generic idea of African food,' he adds. 'I'm not trying to appeal to the global universe -- I'm trying to help Black identities enjoy their culture on a more regular basis."

Response: "The-royalty ens -but-e-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but-but"

chunyat commented 11 months ago

Anyone managed to find a workaround for this yet?

I have got a similar issue as well for a flan-t5-base on my own finetuned task - it works fine when loaded as a HF model, but when converted to either fp16 or fp32 with the huggingface_t5_ckpt_convert.py script and used from within the triton inference server, it returns a similar kind of nonsense, basically just 1 word on repeat like in the 2 previous examples.

triton-inference-server / fastertransformer_backend

All flan-t5 doesn't work for me #114

Description

Reproduced Steps