[Bug]: No available version integration of NvidiaTriton

run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

https://docs.llamaindex.ai

MIT License

35.25k stars 4.95k forks source link

[Bug]: No available version integration of NvidiaTriton #14716

Open rkoyanagi opened 1 month ago

rkoyanagi commented 1 month ago

Bug Description

First I use llama-index 0.9.13 and pip install llama-index-llms-nvidia-triton(version==0.0.1 is installed and llama-index-core==0.9.56 installed)

But I cannot import client. (Below code was error.)

llama_index.llms import NvidiaTriton

I saw repository, and at version0.10, nvidia_triton_client added. So I update llama_index==0.10.0 But below error occurred.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index-llms-nvidia-triton 0.0.1 requires llama-index-core<0.10.0,>=0.9.32, but you have llama-index-core 0.10.54.post1 which is incompatible.

So I tried pip install llama-index-core==0.9.56 But below error occurred.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.10.0 requires llama-index-core<0.11.0,>=0.10.0, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-agent-openai 0.1.7 requires llama-index-core<0.11.0,>=0.10.1, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-embeddings-openai 0.1.10 requires llama-index-core<0.11.0,>=0.10.1, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-llms-openai 0.1.25 requires llama-index-core<0.11.0,>=0.10.24, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-multi-modal-llms-openai 0.1.7 requires llama-index-core<0.11.0,>=0.10.1, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-program-openai 0.1.6 requires llama-index-core<0.11.0,>=0.10.1, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-question-gen-openai 0.1.3 requires llama-index-core<0.11.0,>=0.10.1, but you have llama-index-core 0.9.56 which is incompatible.
llama-index-readers-file 0.1.30 requires llama-index-core<0.11.0,>=0.10.37.post1, but you have llama-index-core 0.9.56 which is incompatible.

I think no available version set integration of NvidiaTriton. How to use NvidiaTriton Integration?

Version

0.10.0

Steps to Reproduce

pip install above version set.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

Hey @rkoyanagi! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let's tackle this issue together while we wait for a human maintainer.

To resolve the dependency conflicts and use the NvidiaTriton integration with the llama-index package, follow these steps:

Upgrade llama-index and llama-index-core to compatible versions: Ensure you are using versions that are compatible with each other. For NvidiaTriton integration, you should use llama-index version 0.10.0 or higher and llama-index-core version 0.10.1 or higher.
```
pip install llama-index==0.10.0 llama-index-core==0.10.1
```
Install the llama-index-llms-nvidia-triton package: Make sure you have the correct version of the NvidiaTriton package that is compatible with the llama-index and llama-index-core versions.
```
pip install llama-index-llms-nvidia-triton==0.1.5
```

Use the NvidiaTriton integration: Here is an example of how to use the NvidiaTriton integration:

from llama_index.llms.nvidia_triton import NvidiaTriton

# Provide the correct URL for your Triton server instance
triton_url = "localhost:8001"

# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url)

# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp.text)

By following these steps, you should be able to resolve the dependency conflicts and use the NvidiaTriton integration with the llama-index package ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

rkoyanagi commented 1 month ago

I tried, but at pip install llama-index-llms-nvidia-triton==0.1.5 below error occurred.

ERROR: Cannot install tritonclient[all]==2.41.1, tritonclient[all]==2.42.0, tritonclient[all]==2.43.0, tritonclient[all]==2.44.0, tritonclient[all]==2.45.0, tritonclient[all]==2.46.0 and tritonclient[all]==2.47.0 because these package versions have conflicting dependencies.

dosubot[bot] commented 1 month ago

To resolve the dependency conflicts with tritonclient, you can try the following steps:

Install llama-index and llama-index-core first: Ensure you have the compatible versions of llama-index and llama-index-core installed.
```
pip install llama-index==0.10.0 llama-index-core==0.10.1
```
Install tritonclient separately: Manually install a specific version of tritonclient that is known to work well with your setup. For example, you can try installing version 2.41.1:
```
pip install tritonclient[all]==2.41.1
```
Install llama-index-llms-nvidia-triton without dependencies: Use the --no-deps option to prevent pip from installing conflicting dependencies:
```
pip install llama-index-llms-nvidia-triton==0.1.5 --no-deps
```

By following these steps, you should be able to avoid the conflicting dependencies and successfully use the NvidiaTriton integration with the llama-index package.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@rkoyanagi v0.9.X is not compatible with any of the integration packages. v0.9.x is from before every integration was split into packages.

You'll need to update overall to v0.10.x to use Nvidia triton

rkoyanagi commented 1 month ago

I tried with this version, and exec proposed code. But below error occurred.what library version was wrong?

Code

from llama_index.llms.nvidia_triton import NvidiaTriton

# Provide the correct URL for your Triton server instance
triton_url = "localhost:8001"

# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url)

# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp.text)

Version

llama-index==0.10.55
llama-index-core==0.10.55
llama-index-llms-nvidia-triton==0.1.5
grpcio-health-checking==1.64.1
grpcio-tools==1.64.1
tritonclient==2.47.0

Error

Traceback (most recent call last):
  File "/home/ubuntu/dev_weaviate/a.py", line 10, in <module>
    resp = triton_client.complete("The tallest mountain in North America is ")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/base.py", line 270, in complete
    client.stop_stream(model_params["model_name"], request_id)
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/utils.py", line 376, in stop_stream
    self._send_stop_signals(model_name, request_id)
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/llama_index/llms/nvidia_triton/utils.py", line 268, in _send_stop_signals
    self._client.async_stream_infer(
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 1934, in async_stream_infer
    self._stream._enqueue_request(request)
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_infer_stream.py", line 121, in _enqueue_request
    raise_error(
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/utils/__init__.py", line 40, in raise_error
    raise InferenceServerException(msg=msg) from None
tritonclient.utils.InferenceServerException: The stream is no longer in valid state, the error detail is reported through provided callback. A new stream should be started after stopping the current stream.
Exception ignored in: <function InferenceServerClient.__del__ at 0x7529d3d58cc0>
Traceback (most recent call last):
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 257, in __del__
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 265, in close
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2250, in close
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2231, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'

rkoyanagi commented 1 month ago

In Triton Inference Server logs, below error occurred.How to stop streaming mode?

terminate called after throwing an instance of 'std::runtime_error'
  what():  Streaming is only supported if model is deployed using decoupled mode.
Signal (6) received.
 0# 0x000055AE18C9EA9D in /opt/tritonserver/bin/tritonserver
 1# 0x00007BA54A15B520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 5# 0x00007BA54C076B9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007BA54C08220C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007BA54C082277 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007BA54C0824D8 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007BA4D34696FA in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
10# 0x00007BA4D3490F9C in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
11# 0x00007BA4D34912FB in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
12# 0x00007BA4D3492BDB in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
13# 0x00007BA4D3496C2C in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
14# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
15# 0x00007BA54A92BF04 in /opt/tritonserver/bin/../lib/libtritonserver.so
16# 0x00007BA54A92C26B in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007BA54AA3E88D in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007BA54A930694 in /opt/tritonserver/bin/../lib/libtritonserver.so
19# 0x00007BA54C0B0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
20# 0x00007BA54A1ADAC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
21# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0718 01:00:40.759750 391 pb_stub.cc:1971]  Non-graceful termination detected.
I0718 01:00:40.759849 390 pb_stub.cc:1971]  Non-graceful termination detected.
I0718 01:00:40.886518 426 pb_stub.cc:1971]  Non-graceful termination detected.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-3-52 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

rkoyanagi commented 1 month ago

@dosu.

dosubot[bot] commented 1 month ago

To resolve the issues with the NvidiaTriton integration in the llama-index package, specifically addressing the error related to streaming mode as indicated by the Triton Inference Server logs, you need to ensure that the stream_complete method is properly implemented. The current implementation raises a NotImplementedError, which means streaming mode is not supported.

Here is a possible implementation for the stream_complete method:

def stream_complete(
    self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponseGen:
    from tritonclient.utils import InferenceServerException

    client = self._get_client()
    invocation_params = self._get_model_default_parameters
    invocation_params.update(kwargs)
    invocation_params["prompt"] = [[prompt]]
    model_params = self._identifying_params
    model_params.update(kwargs)
    request_id = str(random.randint(1, 9999999))  # nosec

    if self.triton_load_model_call:
        client.load_model(model_params["model_name"])

    result_queue = client.request_streaming(
        model_params["model_name"], request_id, **invocation_params
    )

    def gen() -> CompletionResponseGen:
        text = ""
        for token in result_queue:
            if isinstance(token, InferenceServerException):
                client.stop_stream(model_params["model_name"], request_id)
                raise token
            text += token
            yield CompletionResponse(text=text, delta=token)

    return gen()

This implementation sets up the streaming request and yields tokens as they are received from the Triton Inference Server. This should address the streaming mode error in the Triton Inference Server logs ^[1].

Additionally, ensure that you have followed the setup steps for the Triton Inference Server and installed the necessary dependencies as outlined in the LlamaIndex Nvidia Triton documentation:

Launch Triton Inference Server:

docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Install dependencies:

pip install \
  "psutil" \
  "pynvml>=11.5.0" \
  "torch==2.1.2" \
  "tensorrt_llm==0.8.0" --extra-index-url https://pypi.nvidia.com/

Install Triton CLI:

pip install git+https://github.com/triton-inference-server/triton_cli.git

Generate model repository and start Triton Server:

triton remove -m all
triton import -m gpt2 --backend tensorrtllm
triton start &

Install tritonclient:
```
pip install tritonclient[all]
```

Install llama-index connector:

pip install llama-index-llms-nvidia-triton

By ensuring that the Triton Inference Server is correctly set up and all dependencies are installed, you can avoid common issues and ensure smooth integration with the llama-index package ^[2].

To continue talking to Dosu, mention @dosu.

rkoyanagi commented 1 month ago

@dosu Proposed method stream_complete has grammatical error, and error that I occurred is not NotImplementedError . I occurred below error tritonclient.utils.InferenceServerException: The stream is no longer in valid state, the error detail is reported through provided callback. A new stream should be started after stopping the current stream. Exception ignored in: <function InferenceServerClient.__del__ at 0x7529d3d58cc0>.

dosubot[bot] commented 1 month ago

To resolve the issue of "Streaming is only supported if the model is deployed using decoupled mode" and properly stop the streaming mode when using the NvidiaTriton integration with the llama-index package, you need to ensure that the model is deployed in decoupled mode and use the provided methods to stop the streaming.

Here is the relevant code to stop the streaming mode:

def stop_stream(
    self, model_name: str, request_id: str, signal: bool = True
) -> None:
    """Close the streaming connection."""
    if signal:
        self._send_stop_signals(model_name, request_id)
    self._client.stop_stream()

To stop the streaming mode, you can call the stop_stream method with the appropriate parameters:

# Assuming you have an instance of GrpcTritonClient
client = GrpcTritonClient(server_url="your_triton_server_url")

# Model name and request ID used during the streaming request
model_name = "your_model_name"
request_id = "your_request_id"

# Stop the streaming
client.stop_stream(model_name, request_id)

Make sure that the model is deployed in decoupled mode on the Triton server to support streaming. Additionally, the StreamingResponseGenerator class in the llama-index package handles the streaming response and stops the stream when necessary:

class StreamingResponseGenerator(Queue):
    """A Generator that provides the inference results from an LLM."""

    def __init__(
        self, client: "GrpcTritonClient", request_id: str, force_batch: bool
    ) -> None:
        """Instantiate the generator class."""
        super().__init__()
        self._client = client
        self.request_id = request_id
        self._batch = force_batch

    def __iter__(self) -> "StreamingResponseGenerator":
        """Return self as a generator."""
        return self

    def __next__(self) -> str:
        """Return the next retrieved token."""
        val = self.get()
        if val is None or val in STOP_WORDS:
            self._stop_stream()
            raise StopIteration
        return val

    def _stop_stream(self) -> None:
        """Drain and shutdown the Triton stream."""
        self._client.stop_stream(
            "tensorrt_llm", self.request_id, signal=not self._batch
        )

This class ensures that the streaming connection is properly closed when the end of the stream is reached or a stop signal is encountered.

Additionally, the NvidiaTriton class in the llama-index package provides methods to interact with the Triton server, including streaming completions:

class NvidiaTriton(LLM):
    """Nvidia Triton LLM."""

    def __init__(self, server_url: str = DEFAULT_SERVER_URL, model: str = DEFAULT_MODEL, **kwargs: Any) -> None:
        # Initialization code
        self._client = GrpcTritonClient(server_url)

    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        client = self._get_client()
        invocation_params = self._get_model_default_parameters
        invocation_params.update(kwargs)
        invocation_params["prompt"] = [[prompt]]
        model_params = self._identifying_params
        model_params.update(kwargs)
        request_id = str(random.randint(1, 9999999))  # nosec

        if self.triton_load_model_call:
            client.load_model(model_params["model_name"])

        result_queue = client.request_streaming(
            model_params["model_name"], request_id, **invocation_params
        )
        response = ""
        for token in result_queue:
            if isinstance(token, InferenceServerException):
                client.stop_stream(model_params["model_name"], request_id)
                raise token
            response += token

        return CompletionResponse(text=response)

    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        client = self._get_client()
        invocation_params = self._get_model_default_parameters
        invocation_params.update(kwargs)
        invocation_params["prompt"] = [[prompt]]
        model_params = self._identifying_params
        model_params.update(kwargs)
        request_id = str(random.randint(1, 9999999))  # nosec

        if self.triton_load_model_call:
            client.load_model(model_params["model_name"])

        result_queue = client.request_streaming(
            model_params["model_name"], request_id, **invocation_params
        )

        def gen() -> CompletionResponseGen:
            text = ""
            for token in result_queue:
                if isinstance(token, InferenceServerException):
                    client.stop_stream(model_params["model_name"], request_id)
                    raise token
                text += token
                yield CompletionResponse(text=text, delta=token)

        return gen()

This ensures that the streaming connection is properly managed and closed when necessary ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

rkoyanagi commented 1 month ago

I edit Triton Inference Server to use decouple mode, I got it to work for the time being, but I get below exception.Where should I change?

Exception ignored in: <function InferenceServerClient.__del__ at 0x72910cd24cc0>
Traceback (most recent call last):
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 257, in __del__
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/tritonclient/grpc/_client.py", line 265, in close
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2250, in close
  File "/home/ubuntu/dev_weaviate/venv/lib/python3.12/site-packages/grpc/_channel.py", line 2231, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'

@dosu

dosubot[bot] commented 1 month ago

To resolve the AttributeError: 'NoneType' object has no attribute 'StatusCode' exception when using the Triton Inference Server in decoupled mode, ensure that the Triton server is running and accessible at the specified URL. The provided setup and code should be correctly configured as follows:

Ensure the Triton server is running:

docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3

Install the necessary dependencies:

pip install \
  "psutil" \
  "pynvml>=11.5.0" \
  "torch==2.1.2" \
  "tensorrt_llm==0.8.0" --extra-index-url https://pypi.nvidia.com/

Install Triton CLI:

pip install git+https://github.com/triton-inference-server/triton_cli.git

Generate the model repository and start the Triton server:

triton remove -m all
triton import -m gpt2 --backend tensorrtllm
triton start &

Install the tritonclient package:
```
pip install tritonclient[all]
```

Install the llama index connector:

pip install llama-index-llms-nvidia-triton

Use the following Python code to interact with the Triton server:

from llama_index.llms.nvidia_triton import NvidiaTriton

# Ensure a Triton server instance is running and provide the correct URL for your Triton server instance
triton_url = "localhost:8001"
model_name = "gpt2"

# Instantiate the NvidiaTriton class
triton_client = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32)

# Call the complete method with a prompt
resp = triton_client.complete("The tallest mountain in North America is ")
print(resp)

If the server is not running or the URL is incorrect, you may encounter the NoneType error. Ensure the server is accessible and the URL is correct. Additionally, make sure that the NvidiaTriton class is properly instantiated and the complete method is called correctly as shown in the example.

The NvidiaTriton class from the llama_index library includes several properties and methods to handle the Triton client connection and model invocation. Ensure that the _client attribute is properly initialized and that the complete method is correctly implemented to handle the Triton client calls. Here is a snippet from the NvidiaTriton class:

class NvidiaTriton(LLM):
    server_url: str = Field(
        default=DEFAULT_SERVER_URL,
        description="The URL of the Triton inference server to use.",
    )
    model_name: str = Field(
        default=DEFAULT_MODEL,
        description="The name of the Triton hosted model this client should use",
    )
    # Other fields...

    _client: Optional[GrpcTritonClient] = PrivateAttr()

    def __init__(self, server_url: str = DEFAULT_SERVER_URL, model: str = DEFAULT_MODEL, **kwargs: Any) -> None:
        super().__init__(server_url=server_url, model=model, **kwargs)
        try:
            self._client = GrpcTritonClient(server_url)
        except ImportError as err:
            raise ImportError(
                "Could not import triton client python package. "
                "Please install it with `pip install tritonclient`."
            ) from err

    def _get_client(self) -> Any:
        if not self.reuse_client:
            return GrpcTritonClient(self.server_url)
        if self._client is None:
            self._client = GrpcTritonClient(self.server_url)
        return self._client

    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        from tritonclient.utils import InferenceServerException

        client = self._get_client()
        invocation_params = self._get_model_default_parameters()
        invocation_params.update(kwargs)
        invocation_params["prompt"] = [[prompt]]
        model_params = self._identifying_params
        model_params.update(kwargs)
        request_id = str(random.randint(1, 9999999))  # nosec

        if self.triton_load_model_call:
            client.load_model(model_params["model_name"])

        result_queue = client.request_streaming(
            model_params["model_name"], request_id, **invocation_params
        )

        response = ""
        for token in result_queue:
            if isinstance(token, InferenceServerException):
                client.stop_stream(model_params["model_name"], request_id)
                raise token
            response = response + token

        return CompletionResponse(text=response)

Ensure that the GrpcTritonClient is correctly instantiated and that the complete method handles the Triton client calls properly. This should help resolve the AttributeError: 'NoneType' object has no attribute 'StatusCode' exception ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.