triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.07k stars 1.45k forks source link

Implementing whole RAG pipeline #7088

Closed sourabh-patil closed 1 month ago

sourabh-patil commented 5 months ago

Hi team,

I wish to move my whole RAG pipeline on Triton server. I suppose we can use python backend to run multiple models, but I had few queries. Can I use local database like ChromaDB or remote database like Qdrant using python backend? I want to use vllm to accelerate the inference speed of my llama2, I know that there is vllm backend for this, but can we do this in python backend? Any suggestions here? Thanks!

P.S. It would be great if any resources to follow are available.

Tabrizian commented 5 months ago

Can I use local database like ChromaDB or remote database like Qdrant using python backend?

You can use a remote or local database depending on what is suitable for your needs. Python backend doesn't have restrictions on this front.

I know that there is vllm backend for this, but can we do this in python backend?

Yes, vllm_backend is itself a Python-based backend. You can basically copy the contents of model.py in your Python model and you should be able to have the same functionality.

Please let us know if you run into any issues when creating RAG pipelines in Triton.

(cc @oandreeva-nv for viz)

oandreeva-nv commented 5 months ago

Agree with @Tabrizian, one potential solution would be using BLS, where you first query your DB for relevant content, then enhance prompt with the context you just retrieved, and then send a complex prompt to a vllm model (deployed through vllm backend or through a python backend).

Alternatively, you can deploy a complex python model, which will do all those things. In this case, make sure that execute function performs all necessary steps for the RAG process

sourabh-patil commented 5 months ago

Sure! Thanks for the inputs @Tabrizian @oandreeva-nv I will get back if I run into any issues.

MatteoPagliani commented 2 months ago

Hi @Tabrizian and @oandreeva-nv. I am replying to this issue instead of creating another one since my questions are related to the same exact topic.

My colleagues and I have been working on the deployment of an LLM-based application leveraging NVIDIA Triton Inference Server and TensorRT-LLM backend. Specifically, we are trying to integrate RAG capabilities and guardrails in the system.

I searched for open-source projects implementing this use case in order to take inspiration for our needs, but I really struggled to find any example on the integration of RAG and/or guardrails into Triton deployment pipeline.

If I understood correctly, @oandreeva-nv suggested implementing the RAG logic using Business Logic Scripting and embedding it into tensorrt_llm_bls model.py script that gets called by tritonserver.exe at inference time.

I think this approach is somehow similar to what has been implemented here with Redis (the use-case is different though).

Would you still recommend this approach to me?

The issue that I think we are going to face with this approach is that we are going to add new packages inside the tritonserver Docker container (e.g., nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 in our case) and it might not be trivial to manage all the requirements and dependencies. For example, assuming that I need llama-index package to develop our RAG logic, if I trivially run pip install llama-index I don't have the guarantee that the versions of the packages required by llama-index match the ones already pre-installed in the Docker container. Is there a smart way of dealing with this problem? Or do you think it is better to design a different, more modular architecture where each component of the system (e.g., RAG, guardrails, etc.) lives in an independent Docker container such that the tritonserver container is not touched? In this case we would need to find an effective way to let the different containers communicate with each other.

Do you have some suggestions about the architecture needed to build a system like the one mentioned above?

Then, if you know about similar open-source applications that are already in-place, can you kindly point me towards them such that I can have a look? I am sure I am not the first one to integrate RAG into Triton so I am curious about what the community has been up to.

Thanks for your time.

oandreeva-nv commented 2 months ago

Hi @MatteoPagliani,

BLS is one way of doing it, ensembles also work. The idea is one of your models [or this part can also be implemented in the main bls model] should be a DB retriever. In this case in the initialize function you should define all parameters that will not change at runtime, i.e. set up your DB client and e.g. index name if required. For example, in one of my experiments I'm using marqo db, and here's my initialize function:

def initialize(self, args):
        self.args = args
        self.index_name = 'marqo-simplewiki-demo-all'
        self.client = mq.Client()
        model_config = json.loads(args['model_config'])
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "text_output")
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])

My DB is run locally, so client is initialized with default localhost url.

Then in the execute function I query db with the received prompt to extract the most relevant content:

 def execute(self, requests):

        responses = []
        for request in requests:
            prompt = pb_utils.get_input_tensor_by_name(
                request, "text_input"
            ).as_numpy()[0]
            if isinstance(prompt, bytes):
                prompt = prompt.decode("utf-8")

            results = self.client.index(self.index_name).search(prompt)
            out_tensor_0 = pb_utils.Tensor("text_output",
                                           np.array(results['hits'][0]['content']).astype(self.output0_dtype))
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0])

            responses.append(inference_response)

And top level bls model sends request to this context_retrieval model as such:

 def execute(self, requests):

        responses = []
        for request in requests:
            prompt = pb_utils.get_input_tensor_by_name(
                request, "text_input"
            )
            infer_request = pb_utils.InferenceRequest(
                model_name="content_retrieval",
                requested_output_names=["text_output"],
                inputs=[prompt]
            )

            infer_response = infer_request.exec()

            context = pb_utils.get_output_tensor_by_name(infer_response, "text_output").as_numpy().item()
            ...

Note, this is just a reference implementation, feel free to use however you like.

As far as dependencies go, unfortunately it is a case-by-case issue. That being said, could you describe more how you plan to use llama-index? If I remember correctly, this belongs to the client-side of the set-up, or you would like to use it to compose your entire pipeline? Additionally, it should have triton connectors, however I'm not sure of their reliability.

MatteoPagliani commented 2 months ago

Thanks @oandreeva-nv for the insights. Your example has definitely shed some light on the approach to follow.

Just adding a couple of notes on our setup. In our use-case the documents for RAG will not be provided by the end users. Rather, we will collect the documents in advance, create and host the vectorDB locally server-side and then we want to be able to query it at inference time. I guess this is somehow similar to your example.

Looking at your example I realized we might get rid of llama-index, we just need a client that interacts with the vectorDB server. Does this server run in another container in the example you provided?

Coming to the requirements issue, I got that the only viable approach is to run pip install package_name and hope that the dependencies don't break the installation. I tried to figure out if pip-tools can solve the problem, but I haven't come to a solution yet. The idea would be to collect all the python packages pre-installed in tritonserver container together with their version, add these to a requirements.in file with all the versions pinned, then add to the requirements the new packages (e.g., marqo) and then run pip-compile on this file in order to get a requirements.txt file where we can find the versions of the new packages whose dependencies match the versions of the already installed packages. Unfortunately, I am stuck since many pre-installed packages are not found by pip-compile on PyPI and so the compilation stops with an error. If a reasonable approach to install new packages in tritonserver container comes to your mind, please let me know. At the moment we are basically crossing fingers hoping nothing breaks, and if we get errors we need to manually search for the version of the package that does not break the installation (quite painful).

Moreover, I am curious about how you are debugging the model.py script above. I haven't found a way to debug the calls to execute since the calls come from tritonserver.exe. There are related issues about this: here, here and here. Can you describe your approach to debug?

Really appreciate your help.

MatteoPagliani commented 1 month ago

Hi again @oandreeva-nv. I'm going to add a couple of questions to the above ones since I think they are related to the same use case.

Considering the deployment of a TensorRT-LLM model with inflight batching we should modify the count field of the instance_group block in this config.pbtxt file such that ${bls_instance_count} is equal to max_batch_size. Do we also need to do this in the config.pbtxt file related to preprocessing, tensorrt_llm and postprocessing models? If not, can you explain why?

Another question related to instance_group: should we keep the KIND_CPU value for the kind field in the config.pbtxt files of preprocessing, tensorrt_llm, tensorrt_llm_bls and postprocessing? If we deploy a TensorRT-LLM engine it makes sense to me to change KIND_CPU to KIND_GPU in tensorrt_llm config.pbtxt file, but I am not really sure this is right.

Looking forward to your reply. Thanks.

oandreeva-nv commented 1 month ago

Apologies for the late reply.

Looking at your example I realized we might get rid of llama-index, we just need a client that interacts with the vectorDB server. Does this server run in another container in the example you provided?

In my local set up, vectorDB service was started in a separate docker container.

Regarding dependencies. To check what Triton dependencies, you can call build.py with --dryrun flag, please refer here. Please, also note that backends come with their own dependencies, but --dryrun should catch them as well.

Do we also need to do this in the config.pbtxt file related to preprocessing, tensorrt_llm and postprocessing models?

Yes, we have a tutorial that helps you to set up minimal amount of parameters needed for trtllm backend. Unfortunately, in one of their latest releases TRTLLM also made backend field customizable, so make sure you fill in that field as well via: python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm

If we deploy a TensorRT-LLM engine it makes sense to me to change KIND_CPU to KIND_GPU in tensorrt_llm config.pbtxt file, but I am not really sure this is right.

TensorRT-LLM backend only handles KIND_CPU, and the backend will just utilize the GPU for you. You would have to modify CUDA_VISIBLE_DEVICES if you want to use some specific GPUs. preprocessing and postprocessing are deployed via python bckend AFAIK, so if you want to deploy them on GPUs, then I would recommend trying to change KIND_CPU to KIND_GPU and see if you get any boost in performance.

MatteoPagliani commented 1 month ago

Regarding dependencies. To check what Triton dependencies, you can call build.py with --dryrun flag, please refer here. Please, also note that backends come with their own dependencies, but --dryrun should catch them as well.

Great, I'll have a look!

Yes, we have a tutorial that helps you to set up minimal amount of parameters needed for trtllm backend.

In this tutorial the INSTANCE_COUNT is set to 1 while MAX_BATCH_SIZE is 4. I think they did not set INSTANCE_COUNT = MAX_BATCH_SIZE because they assume to use the ensemble model (this is confirmed by the endpoint localhost:8000/v2/models/ensemble/generate). In fact, as far as I understood, when using the BLS model instead of the ensemble we should set in config.pbtxt of tensorrt_llm_bls the number of model instances equal to the maximum batch size supported by the TensorRT-LLM engine to allow concurrent request execution with inflight batching. Just to be sure, can you confirm we need to do this also in the config.pbtxt related to preprocessing, tensorrt_llm and postprocessing models when using BLS instead of ensemble? Sorry to repeat myself, I want to avoid misunderstandings from my side.

Moreover, I am curious about how you are debugging the model.py script above. I haven't found a way to debug the calls to execute since the calls come from tritonserver.exe. There are related issues about this: here, here and here. Can you describe your approach to debug?

Finally, any guidelines about this?

Once again, thanks a lot for the support @oandreeva-nv

oandreeva-nv commented 1 month ago

In fact, as far as I understood, when using the BLS model instead of the ensemble we should set in config.pbtxt of tensorrt_llm_bls the number of model instances equal to the maximum batch size supported by the TensorRT-LLM engine to allow concurrent request execution with inflight batching.

Correct.

can you confirm we need to do this also in the config.pbtxt related to preprocessing, tensorrt_llm and postprocessing models when using BLS instead of ensemble?

tensorrt_llm - No. preprocessing, postprocessing - Not necessary. You can experiment with your workload and see if count = 1 is enough to reach max batch size in inflight batching. To do this, use launch_triton_server.py with --log and --log-file parameters and monitore "Active Request Count" number.

Finally, any guidelines about this?

I would recommend starting your server with --log-verbose 1, this way if something doesn't work, you'll see logs printed. Additionally, try to utilize logging to your advantage

MatteoPagliani commented 1 month ago

@oandreeva-nv I will do all the tests you suggested.

May I ask you how you configured the config.pbtxt file of the content_retrieval model in your experiment using marqoDB? I would structure it in the following way:

name: "content_retrieval"
backend: "python"
max_batch_size: 1

input [
    {
        name: "text_input"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

output [
    {
        name: "text_output"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

Thanks.

oandreeva-nv commented 1 month ago

The above config looks good to me. In my case I simplified it to :

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

And put max_batch_size, input and output definitions to auto_complete_config. The nobatch_model.py and batch_model.py model files are heavily commented with explanations about how to utilize set_max_batch_size, add_input, and add_output functions to set max_batch_size, input and output properties of the model.

Model's name can be inferred from the directory name (in case you model repository is local) and Triton can infer python backend as well, based on the .py extension for the model

MatteoPagliani commented 1 month ago

Thank you so much @oandreeva-nv for the support, really appreciated. With your suggestions I was able to reproduce your workflow. I think you can close this. If new questions/issues arise I will comment again.

oandreeva-nv commented 1 month ago

Happy to hear, @MatteoPagliani! Yes, feel free to ask questions here or open a new issue.