towhee-io / towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
https://towhee.io
Apache License 2.0
3.16k stars 246 forks source link

[Bug]: Error: Start the triton server #2673

Closed Mrzhiyao closed 10 months ago

Mrzhiyao commented 10 months ago

Is there an existing issue for this?

Current Behavior

root@85d70c862b32:/opt/tritonserver# tritonserver --model-repository pwd/models W1109 05:31:06.568839 124 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version I1109 05:31:06.568981 124 cuda_memory_manager.cc:115] CUDA memory pool disabled I1109 05:31:06.569292 124 tritonserver.cc:2176] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.24.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens | | | or_data statistics trace | | model_repository_path[0] | /opt/tritonserver/models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1109 05:31:06.569348 124 server.cc:257] No server context available. Exiting immediately. error: creating server: Internal - failed to stat file /opt/tritonserver/models

Expected Behavior

I'm following the official documentation to deploy triton server and start towhee to speed up coding.

I got an error in step “Start the Triton server”after entering the server. But I can use towhee for encoding in the local environment if I don't use the triton server method. It prompts whether the cuda driver and version in the error message is the reason why it cannot be executed. How can I continue the operation?

Steps To Reproduce

1.Build Image
from towhee import pipe, ops, AutoConfig
import numpy as np

p = (
    pipe.input('text')
    .map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
    .output('vec')
)

import towhee

towhee.build_docker_image(
    dc_pipeline=p,
    image_name='clip:v1',
    cuda_version='11.7', # '117dev' for developer
    format_priority=['onnx'],
    parallelism=4,
    inference_server='triton'
)

2.Create models

import towhee
from towhee import pipe, ops, AutoConfig
import numpy as np
p = (
    pipe.input('text')
    .map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
    .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
    .output('vec')
)

towhee.build_pipeline_model(
    dc_pipeline=p,
    model_root='models',
    format_priority=['onnx'],
    parallelism=4,
    server='triton'
)

Environment

- Towhee version(1.1.2):
- OS(Ubuntu or CentOS):Ubuntu
- GPU:3090
- tritonsever:22.07
- cuda:11.7
- Cuda Driver:535.129.03

(base) eg@eg-HP-Z8-G4-Workstation:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

(base) eg@eg-HP-Z8-G4-Workstation:~$ nvidia-smi
Thu Nov  9 14:03:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |

root@85d70c862b32:/opt/tritonserver# nvcc -v
nvcc fatal   : No input files specified; use option --help for more information
root@85d70c862b32:/opt/tritonserver# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

root@85d70c862b32:/opt/tritonserver# nvidia-smi
bash: nvidia-smi: command not found

Anything else?

No response

junjiejiangjjj commented 10 months ago

image Did you use the --gpu params when starting docker

Mrzhiyao commented 10 months ago

This problem was solved after I restarted the container, but a new error occurred when executing the program.

Traceback (most recent call last): File "/home/eg/PycharmProjects/Towhee/triton_endcod.py", line 8, in res = client(data) File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 81, in call return self._loop.run_until_complete(self._call(inputs))[0] File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 68, in _call response = await self._client.infer(self._model_name, inputs) File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 757, in infer response = await self._post( File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 209, in _post res = await self._stub.post( File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client.py", line 586, in _request await resp.start(conn) File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 920, in start self._continue = None File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/helpers.py", line 725, in exit raise asyncio.TimeoutError from None asyncio.exceptions.TimeoutError

Mrzhiyao commented 10 months ago

image Did you use the --gpu params when starting docker

Yes, the problem was solved after I recreated the container, but a new problem appeared. Do you know how to solve this problem?

junjiejiangjjj commented 10 months ago

It seems that access to the triton server timeout. Are there any logs on the server?

Mrzhiyao commented 10 months ago

It seems that access to the triton server timeout. Are there any logs on the server?

docker logs shows that:

NVIDIA Release 22.07 (build 41737377) Triton Server Version 2.24.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I1109 06:53:09.532688 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6a4e000000' with size 268435456 I1109 06:53:09.533016 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I1109 06:53:09.536004 1 model_repository_manager.cc:1206] loading: pipeline:1 I1109 06:53:09.536049 1 model_repository_manager.cc:1206] loading: sentence-embedding.sbert-0:1 /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " I1109 06:53:11.225232 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime I1109 06:53:11.225295 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10 I1109 06:53:11.225317 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10 I1109 06:53:11.225331 1 onnxruntime.cc:2504] backend configuration: {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} I1109 06:53:11.259270 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: sentence-embedding.sbert-0 (version 1) W1109 06:53:14.630221 1 onnxruntime.cc:787] autofilled max_batch_size to 4 for model 'sentence-embedding.sbert-0' since batching is supporrted but no max_batch_size is specified in model configuration. Must specify max_batch_size to utilize autofill with a larger max batch size I1109 06:53:14.685000 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_0 (CPU device 0) /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " I1109 06:53:17.996107 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: sentence-embedding.sbert-0_0 (GPU device 0) I1109 06:53:20.312004 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_1 (CPU device 0) I1109 06:53:20.312255 1 model_repository_manager.cc:1352] successfully loaded 'sentence-embedding.sbert-0' version 1 /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " I1109 06:53:23.568245 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_2 (CPU device 0) /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " I1109 06:53:26.839855 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_3 (CPU device 0) /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " I1109 06:53:30.081773 1 model_repository_manager.cc:1352] successfully loaded 'pipeline' version 1 I1109 06:53:30.082043 1 server.cc:559] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I1109 06:53:30.082215 1 server.cc:586]

+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b | | | | ackends","default-max-batch-size":"4"}} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b | | | | ackends","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

I1109 06:53:30.082348 1 server.cc:629] +----------------------------+---------+--------+ | Model | Version | Status | +----------------------------+---------+--------+ | pipeline | 1 | READY | | sentence-embedding.sbert-0 | 1 | READY | +----------------------------+---------+--------+

I1109 06:53:30.135753 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090 I1109 06:53:30.136027 1 tritonserver.cc:2176] I1109 06:53:30.137643 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001 I1109 06:53:30.137940 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000 I1109 06:53:30.179419 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

junjiejiangjjj commented 10 months ago

curl http://0.0.0.0:8000/v2/models/stats Check the server is available

Mrzhiyao commented 10 months ago

curl http://0.0.0.0:8000/v2/models/stats Check the server is available

I set the local port to 8010. So I can get such a result, what may be the cause of the error in this case, thank you for your help.

(base) eg@eg-HP-Z8-G4-Workstation:~$ curl http://0.0.0.0:8010/v2/models/stats {"model_stats":[{"name":"pipeline","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]},{"name":"sentence-embedding.sbert-0","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]}]}

junjiejiangjjj commented 10 months ago

Try ops.sentence_embedding.transformers, sbert has some bugs. image This pipeline works fine.

Mrzhiyao commented 10 months ago

Try ops.sentence_embedding.transformers, sbert has some bugs. image This pipeline works fine.

Thank you for your help. I think my problem has been resolved. My other question is, which parameters can further improve the encoding speed by accelerating model inference through the Triton server in parameter settings.

junjiejiangjjj commented 10 months ago

It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server

Mrzhiyao commented 10 months ago

It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server

Thank you very much for your help. I think my problem has been resolved.