Open rkoyanagi opened 1 month ago
Hey @rkoyanagi! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let's tackle this issue together while we wait for a human maintainer.
To resolve the dependency conflicts and use the NvidiaTriton integration with the llama-index
package, follow these steps:
Upgrade llama-index
and llama-index-core
to compatible versions:
Ensure you are using versions that are compatible with each other. For NvidiaTriton integration, you should use llama-index
version 0.10.0
or higher and llama-index-core
version 0.10.1
or higher.
pip install llama-index==0.10.0 llama-index-core==0.10.1
Install the llama-index-llms-nvidia-triton
package:
Make sure you have the correct version of the NvidiaTriton package that is compatible with the llama-index
and llama-index-core
versions.
pip install llama-index-llms-nvidia-triton==0.1.5
Use the NvidiaTriton integration: Here is an example of how to use the NvidiaTriton integration:
from llama_index.llms.nvidia_triton import NvidiaTriton
# Provide the correct URL for your Triton server instance
triton_url = "localhost:8001"
# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url)
# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp.text)
By following these steps, you should be able to resolve the dependency conflicts and use the NvidiaTriton integration with the llama-index
package [1][2][3].
To continue talking to Dosu, mention @dosu.
I tried, but at pip install llama-index-llms-nvidia-triton==0.1.5
below error occurred.
ERROR: Cannot install tritonclient[all]==2.41.1, tritonclient[all]==2.42.0, tritonclient[all]==2.43.0, tritonclient[all]==2.44.0, tritonclient[all]==2.45.0, tritonclient[all]==2.46.0 and tritonclient[all]==2.47.0 because these package versions have conflicting dependencies.
To resolve the dependency conflicts with tritonclient
, you can try the following steps:
Install llama-index
and llama-index-core
first:
Ensure you have the compatible versions of llama-index
and llama-index-core
installed.
pip install llama-index==0.10.0 llama-index-core==0.10.1
Install tritonclient
separately:
Manually install a specific version of tritonclient
that is known to work well with your setup. For example, you can try installing version 2.41.1
:
pip install tritonclient[all]==2.41.1
Install llama-index-llms-nvidia-triton
without dependencies:
Use the --no-deps
option to prevent pip from installing conflicting dependencies:
pip install llama-index-llms-nvidia-triton==0.1.5 --no-deps
By following these steps, you should be able to avoid the conflicting dependencies and successfully use the NvidiaTriton integration with the llama-index
package.
To continue talking to Dosu, mention @dosu.
@rkoyanagi v0.9.X is not compatible with any of the integration packages. v0.9.x is from before every integration was split into packages.
You'll need to update overall to v0.10.x to use Nvidia triton
I tried with this version, and exec proposed code. But below error occurred.what library version was wrong?
Code
from llama_index.llms.nvidia_triton import NvidiaTriton
# Provide the correct URL for your Triton server instance
triton_url = "localhost:8001"
# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url)
# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp.text)
Version
llama-index==0.10.55
llama-index-core==0.10.55
llama-index-llms-nvidia-triton==0.1.5
grpcio-health-checking==1.64.1
grpcio-tools==1.64.1
tritonclient==2.47.0
Error
Traceback (most recent call last):
File "/home/ubuntu/dev_weaviate/a.py", line 10, in <module>
resp = triton_client.complete("The tallest mountain in North America is ")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/base.py", line 270, in complete
client.stop_stream(model_params["model_name"], request_id)
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/utils.py", line 376, in stop_stream
self._send_stop_signals(model_name, request_id)
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/utils.py", line 268, in _send_stop_signals
self._client.async_stream_infer(
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 1934, in async_stream_infer
self._stream._enqueue_request(request)
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_infer_stream.py", line 121, in _enqueue_request
raise_error(
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/utils/__init__.py", line 40, in raise_error
raise InferenceServerException(msg=msg) from None
tritonclient.utils.InferenceServerException: The stream is no longer in valid state, the error detail is reported through provided callback. A new stream should be started after stopping the current stream.
Exception ignored in: <function InferenceServerClient.__del__ at 0x7529d3d58cc0>
Traceback (most recent call last):
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 257, in __del__
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 265, in close
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2250, in close
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2231, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'
In Triton Inference Server logs, below error occurred.How to stop streaming mode?
terminate called after throwing an instance of 'std::runtime_error'
what(): Streaming is only supported if model is deployed using decoupled mode.
Signal (6) received.
0# 0x000055AE18C9EA9D in /opt/tritonserver/bin/tritonserver
1# 0x00007BA54A15B520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
5# 0x00007BA54C076B9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007BA54C08220C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007BA54C082277 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# 0x00007BA54C0824D8 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
9# 0x00007BA4D34696FA in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
10# 0x00007BA4D3490F9C in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
11# 0x00007BA4D34912FB in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
12# 0x00007BA4D3492BDB in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
13# 0x00007BA4D3496C2C in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
14# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
15# 0x00007BA54A92BF04 in /opt/tritonserver/bin/../lib/libtritonserver.so
16# 0x00007BA54A92C26B in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007BA54AA3E88D in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007BA54A930694 in /opt/tritonserver/bin/../lib/libtritonserver.so
19# 0x00007BA54C0B0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
20# 0x00007BA54A1ADAC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
21# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0718 01:00:40.759750 391 pb_stub.cc:1971] Non-graceful termination detected.
I0718 01:00:40.759849 390 pb_stub.cc:1971] Non-graceful termination detected.
I0718 01:00:40.886518 426 pb_stub.cc:1971] Non-graceful termination detected.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-3-52 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@dosu.
To resolve the issues with the NvidiaTriton integration in the llama-index
package, specifically addressing the error related to streaming mode as indicated by the Triton Inference Server logs, you need to ensure that the stream_complete
method is properly implemented. The current implementation raises a NotImplementedError
, which means streaming mode is not supported.
Here is a possible implementation for the stream_complete
method:
def stream_complete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponseGen:
from tritonclient.utils import InferenceServerException
client = self._get_client()
invocation_params = self._get_model_default_parameters
invocation_params.update(kwargs)
invocation_params["prompt"] = [[prompt]]
model_params = self._identifying_params
model_params.update(kwargs)
request_id = str(random.randint(1, 9999999)) # nosec
if self.triton_load_model_call:
client.load_model(model_params["model_name"])
result_queue = client.request_streaming(
model_params["model_name"], request_id, **invocation_params
)
def gen() -> CompletionResponseGen:
text = ""
for token in result_queue:
if isinstance(token, InferenceServerException):
client.stop_stream(model_params["model_name"], request_id)
raise token
text += token
yield CompletionResponse(text=text, delta=token)
return gen()
This implementation sets up the streaming request and yields tokens as they are received from the Triton Inference Server. This should address the streaming mode error in the Triton Inference Server logs [1].
Additionally, ensure that you have followed the setup steps for the Triton Inference Server and installed the necessary dependencies as outlined in the LlamaIndex Nvidia Triton documentation:
Launch Triton Inference Server:
docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
Install dependencies:
pip install \
"psutil" \
"pynvml>=11.5.0" \
"torch==2.1.2" \
"tensorrt_llm==0.8.0" --extra-index-url https://pypi.nvidia.com/
Install Triton CLI:
pip install git+https://github.com/triton-inference-server/triton_cli.git
Generate model repository and start Triton Server:
triton remove -m all
triton import -m gpt2 --backend tensorrtllm
triton start &
Install tritonclient
:
pip install tritonclient[all]
Install llama-index
connector:
pip install llama-index-llms-nvidia-triton
By ensuring that the Triton Inference Server is correctly set up and all dependencies are installed, you can avoid common issues and ensure smooth integration with the llama-index
package [2].
To continue talking to Dosu, mention @dosu.
@dosu
Proposed method stream_complete
has grammatical error, and error that I occurred is not NotImplementedError
.
I occurred below error tritonclient.utils.InferenceServerException: The stream is no longer in valid state, the error detail is reported through provided callback. A new stream should be started after stopping the current stream. Exception ignored in: <function InferenceServerClient.__del__ at 0x7529d3d58cc0>
.
To resolve the issue of "Streaming is only supported if the model is deployed using decoupled mode" and properly stop the streaming mode when using the NvidiaTriton integration with the llama-index
package, you need to ensure that the model is deployed in decoupled mode and use the provided methods to stop the streaming.
Here is the relevant code to stop the streaming mode:
def stop_stream(
self, model_name: str, request_id: str, signal: bool = True
) -> None:
"""Close the streaming connection."""
if signal:
self._send_stop_signals(model_name, request_id)
self._client.stop_stream()
To stop the streaming mode, you can call the stop_stream
method with the appropriate parameters:
# Assuming you have an instance of GrpcTritonClient
client = GrpcTritonClient(server_url="your_triton_server_url")
# Model name and request ID used during the streaming request
model_name = "your_model_name"
request_id = "your_request_id"
# Stop the streaming
client.stop_stream(model_name, request_id)
Make sure that the model is deployed in decoupled mode on the Triton server to support streaming. Additionally, the StreamingResponseGenerator
class in the llama-index
package handles the streaming response and stops the stream when necessary:
class StreamingResponseGenerator(Queue):
"""A Generator that provides the inference results from an LLM."""
def __init__(
self, client: "GrpcTritonClient", request_id: str, force_batch: bool
) -> None:
"""Instantiate the generator class."""
super().__init__()
self._client = client
self.request_id = request_id
self._batch = force_batch
def __iter__(self) -> "StreamingResponseGenerator":
"""Return self as a generator."""
return self
def __next__(self) -> str:
"""Return the next retrieved token."""
val = self.get()
if val is None or val in STOP_WORDS:
self._stop_stream()
raise StopIteration
return val
def _stop_stream(self) -> None:
"""Drain and shutdown the Triton stream."""
self._client.stop_stream(
"tensorrt_llm", self.request_id, signal=not self._batch
)
This class ensures that the streaming connection is properly closed when the end of the stream is reached or a stop signal is encountered.
Additionally, the NvidiaTriton
class in the llama-index
package provides methods to interact with the Triton server, including streaming completions:
class NvidiaTriton(LLM):
"""Nvidia Triton LLM."""
def __init__(self, server_url: str = DEFAULT_SERVER_URL, model: str = DEFAULT_MODEL, **kwargs: Any) -> None:
# Initialization code
self._client = GrpcTritonClient(server_url)
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
client = self._get_client()
invocation_params = self._get_model_default_parameters
invocation_params.update(kwargs)
invocation_params["prompt"] = [[prompt]]
model_params = self._identifying_params
model_params.update(kwargs)
request_id = str(random.randint(1, 9999999)) # nosec
if self.triton_load_model_call:
client.load_model(model_params["model_name"])
result_queue = client.request_streaming(
model_params["model_name"], request_id, **invocation_params
)
response = ""
for token in result_queue:
if isinstance(token, InferenceServerException):
client.stop_stream(model_params["model_name"], request_id)
raise token
response += token
return CompletionResponse(text=response)
def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
client = self._get_client()
invocation_params = self._get_model_default_parameters
invocation_params.update(kwargs)
invocation_params["prompt"] = [[prompt]]
model_params = self._identifying_params
model_params.update(kwargs)
request_id = str(random.randint(1, 9999999)) # nosec
if self.triton_load_model_call:
client.load_model(model_params["model_name"])
result_queue = client.request_streaming(
model_params["model_name"], request_id, **invocation_params
)
def gen() -> CompletionResponseGen:
text = ""
for token in result_queue:
if isinstance(token, InferenceServerException):
client.stop_stream(model_params["model_name"], request_id)
raise token
text += token
yield CompletionResponse(text=text, delta=token)
return gen()
This ensures that the streaming connection is properly managed and closed when necessary [1][2][3].
To continue talking to Dosu, mention @dosu.
I edit Triton Inference Server to use decouple mode, I got it to work for the time being, but I get below exception.Where should I change?
Exception ignored in: <function InferenceServerClient.__del__ at 0x72910cd24cc0>
Traceback (most recent call last):
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 257, in __del__
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 265, in close
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2250, in close
File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2231, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'
@dosu
To resolve the AttributeError: 'NoneType' object has no attribute 'StatusCode'
exception when using the Triton Inference Server in decoupled mode, ensure that the Triton server is running and accessible at the specified URL. The provided setup and code should be correctly configured as follows:
Ensure the Triton server is running:
docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
Install the necessary dependencies:
pip install \
"psutil" \
"pynvml>=11.5.0" \
"torch==2.1.2" \
"tensorrt_llm==0.8.0" --extra-index-url https://pypi.nvidia.com/
Install Triton CLI:
pip install git+https://github.com/triton-inference-server/triton_cli.git
Generate the model repository and start the Triton server:
triton remove -m all
triton import -m gpt2 --backend tensorrtllm
triton start &
Install the tritonclient
package:
pip install tritonclient[all]
Install the llama index connector:
pip install llama-index-llms-nvidia-triton
Use the following Python code to interact with the Triton server:
from llama_index.llms.nvidia_triton import NvidiaTriton
# Ensure a Triton server instance is running and provide the correct URL for your Triton server instance
triton_url = "localhost:8001"
model_name = "gpt2"
# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32)
# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp)
If the server is not running or the URL is incorrect, you may encounter the NoneType
error. Ensure the server is accessible and the URL is correct. Additionally, make sure that the NvidiaTriton
class is properly instantiated and the complete
method is called correctly as shown in the example.
The NvidiaTriton
class from the llama_index
library includes several properties and methods to handle the Triton client connection and model invocation. Ensure that the _client
attribute is properly initialized and that the complete
method is correctly implemented to handle the Triton client calls. Here is a snippet from the NvidiaTriton
class:
class NvidiaTriton(LLM):
server_url: str = Field(
default=DEFAULT_SERVER_URL,
description="The URL of the Triton inference server to use.",
)
model_name: str = Field(
default=DEFAULT_MODEL,
description="The name of the Triton hosted model this client should use",
)
# Other fields...
_client: Optional[GrpcTritonClient] = PrivateAttr()
def __init__(self, server_url: str = DEFAULT_SERVER_URL, model: str = DEFAULT_MODEL, **kwargs: Any) -> None:
super().__init__(server_url=server_url, model=model, **kwargs)
try:
self._client = GrpcTritonClient(server_url)
except ImportError as err:
raise ImportError(
"Could not import triton client python package. "
"Please install it with `pip install tritonclient`."
) from err
def _get_client(self) -> Any:
if not self.reuse_client:
return GrpcTritonClient(self.server_url)
if self._client is None:
self._client = GrpcTritonClient(self.server_url)
return self._client
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
from tritonclient.utils import InferenceServerException
client = self._get_client()
invocation_params = self._get_model_default_parameters()
invocation_params.update(kwargs)
invocation_params["prompt"] = [[prompt]]
model_params = self._identifying_params
model_params.update(kwargs)
request_id = str(random.randint(1, 9999999)) # nosec
if self.triton_load_model_call:
client.load_model(model_params["model_name"])
result_queue = client.request_streaming(
model_params["model_name"], request_id, **invocation_params
)
response = ""
for token in result_queue:
if isinstance(token, InferenceServerException):
client.stop_stream(model_params["model_name"], request_id)
raise token
response = response + token
return CompletionResponse(text=response)
Ensure that the GrpcTritonClient
is correctly instantiated and that the complete
method handles the Triton client calls properly. This should help resolve the AttributeError: 'NoneType' object has no attribute 'StatusCode'
exception [1][2][3].
To continue talking to Dosu, mention @dosu.
Bug Description
First I use
llama-index 0.9.13
andpip install llama-index-llms-nvidia-triton
(version==0.0.1 is installed and llama-index-core==0.9.56 installed)But I cannot import client. (Below code was error.)
I saw repository, and at version0.10, nvidia_triton_client added. So I update
llama_index==0.10.0
But below error occurred.So I tried
pip install llama-index-core==0.9.56
But below error occurred.I think no available version set integration of NvidiaTriton. How to use NvidiaTriton Integration?
Version
0.10.0
Steps to Reproduce
pip install above version set.
Relevant Logs/Tracbacks
No response