triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

GPT - J model produces garbage results #125

Open BDODigitalTeam opened 1 year ago

BDODigitalTeam commented 1 year ago

Description

branch : main 
fastertransformer docker: 22:03

!tar -axf step_383500_slim.tar.zstd -C ./models/

0501 19:22:32.031682 3840 libfastertransformer.cc:321] After Loading Model:
I0501 19:22:32.032232 3840 libfastertransformer.cc:537] Model instance is created on GPU NVIDIA RTX A4500 Laptop GPU
I0501 19:22:32.032376 3840 model_repository_manager.cc:1152] successfully loaded 'fastertransformer' version 1
I0501 19:22:32.043795 3840 model_repository_manager.cc:997] loading: ensemble:1
I0501 19:22:32.144280 3840 model_repository_manager.cc:1152] successfully loaded 'ensemble' version 1
I0501 19:22:32.144386 3840 server.cc:524] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0501 19:22:32.144417 3840 server.cc:551] 
+-------------------+-----------------------------------------------------------------------------+--------+
| Backend           | Path                                                                        | Config |
+-------------------+-----------------------------------------------------------------------------+--------+
| pytorch           | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                     | {}     |
| onnxruntime       | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so             | {}     |
| openvino          | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so     | {}     |
| tensorflow        | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so             | {}     |
| python            | /opt/tritonserver/backends/python/libtriton_python.so                       | {}     |
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {}     |
+-------------------+-----------------------------------------------------------------------------+--------+

I0501 19:22:32.144435 3840 server.cc:594] 
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| fastertransformer | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
+-------------------+---------+--------+

I0501 19:22:32.172109 3840 metrics.cc:651] Collecting metrics for GPU 0: NVIDIA RTX A4500 Laptop GPU
I0501 19:22:32.173122 3840 tritonserver.cc:1962] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.20.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | ./triton-model-store/gptj                                                                                                                                                                    |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0501 19:22:32.180397 3840 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001
I0501 19:22:32.181622 3840 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000
I0501 19:22:32.266530 3840 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
# Import libraries
import tritonclient.http as httpclient

# Initizlize client
client = httpclient.InferenceServerClient("localhost:8000",
                                           concurrency=1,
                                           verbose=False)
# ...

# Request text promp from user
print("Write any input prompt for the model and press ENTER:")
# Prepare tokens for sending to the server
inputs = prepare_inputs( [[input()]])
# Sending request
result = client.infer(MODEl_GPTJ_FASTERTRANSFORMER, inputs)
print(result.as_numpy("OUTPUT_0"))

Write any input prompt for the model and press ENTER:

What is the square root of 4

I0501 21:00:34.424219 3840 libfastertransformer.cc:834] Start to forward

[b'What is the square root of 4DOWNunodesDOWNuno Lawyerscher Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyerscheriframeiframeiframe Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyers Lawyersiframe Lawyersiframeiframeiframeiframe Lawyersiframe Lawyers Lawyers Lawyersiframe Lawyersuno Lawyers Lawyers Lawyersdes-|iframe Lawyersdesiframedesdesdesdescheriframeiframeiframeiframeiframe Lawyersdescheriframeiframeiframeiframe Lawyersdesiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframe Lawyersdescheriframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeiframeunoiframe']

I0501 21:00:38.273056 3840 libfastertransformer.cc:836] Stop to forward


### Reproduced Steps

```shell
I followed the article mentioned 
https://developer.nvidia.com/blog/deploying-gpt-j-and-t5-with-fastertransformer-and-triton-inference-server/
but I am having no luck . 
Any help you would help 
Thanks