[Bug]: Not able to import Llama index modules in tensor rt llms docker image.

Bug Description

Even though I am following the latest documentation from llama index . I am still not able to import llama index. Issue 1 Llama index documentation link:https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt/ versions: ubuntu:22.04 cuda : 12.2 architecture --> Arm 64 jetpack - 6.0

Issue 2 Nvidia jetson container link:https://github.com/dusty-nv/jetson-containers?tab=readme-ov-file versions: ubuntu:22.04 cuda : 12.2 architecture --> Arm 64 jetpack - 6.0

please provide standard solution to deploy tensor rt llm integrated with some llama index rag modules. Device used : Advantech Jetson Orin NX - 16 GB variant

1 photo added

Version

llama-index 0.10.50

Steps to Reproduce

Issue 1: 1) follow llama index documentation, step by step provided commands 2) got error at 3rd command from documentation "pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com"

Issue 2: 1) followed dusty nv documenation, steb by step provided commands. 2) got same error mentioned in relevant logs, "$ jetson-containers run $(autotag tensorrt-llm)

Relevant Logs/Tracbacks

Error:
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting tensorrt_llm==0.8.0
  Downloading tensorrt-llm-0.8.0.tar.gz (6.9 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-r7zpl9ve/tensorrt-llm_382951b6d5f34b8798d95f1967eb0620/setup.py", line 90, in <module>
          raise RuntimeError("Bad params")
      RuntimeError: Bad params
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

To address the issues you're encountering with importing Llama Index modules in the TensorRT LLM Docker image on an Advantech Jetson Orin NX - 16 GB variant device, follow these steps:

Ensure CUDA 12.2 or Higher is Installed: Verify that your device has Nvidia CUDA 12.2 or higher installed, as it is required to run TensorRT-LLM.
Install tensorrt_llm: Use the following pip command to install the tensorrt_llm package:
```
pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
```
Prepare Llama2 Model Files: Follow the instructions provided in the TensorRT-LLM GitHub repository to create the necessary model files. The following files should be generated:
- Llama_float16_tp1_rank0.engine
- config.json
- model.cache
Organize Model Files: Create a directory named model and move the generated files into this directory:
```
mkdir model
mv Llama_float16_tp1_rank0.engine config.json model.cache model/
```
Install Llama Index Integration: Install the Llama Index integration for NVIDIA TensorRT:
```
pip install llama-index-llms-nvidia-tensorrt
```

Run the Example Code: Use the following Python code to initialize and run the LocalTensorRTLLM:

from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM

def completion_to_prompt(completion):
   return f"<s> [INST] {completion} [/INST] "

def messages_to_prompt(messages):
   content = ""
   for message in messages:
       content += str(message) + "\n"
   return f"<s> [INST] {content} [/INST] "

llm = LocalTensorRTLLM(
   model_path="./model",
   engine_name="llama_float16_tp1_rank0.engine",
   tokenizer_dir="meta-llama/Llama-2-13b-chat",
   completion_to_prompt=completion_to_prompt,
   messages_to_prompt=messages_to_prompt,
)

resp = llm.complete("Who is Paul Graham?")
print(str(resp))

Troubleshooting `RuntimeError: Bad params`

If you encounter the RuntimeError: Bad params error during the installation of tensorrt_llm, follow these steps:

Ensure Correct Installation: Reinstall tensorrt_llm:

pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

Check CUDA Version: Ensure that you have Nvidia CUDA 12.2 or higher installed.
Verify Model Path: Ensure that the model_path provided exists and contains the necessary files (Llama_float16_tp1_rank0.engine, config.json, model.cache).
Check Configuration File: Ensure that the config.json file in the model directory is correctly formatted and contains all necessary configurations.
Validate Tensor Parallel and Pipeline Parallel Sizes: Ensure that the tensor_parallel and pipeline_parallel sizes in the config.json match the runtime sizes.
Check GPU Availability: Ensure that a CUDA-enabled GPU is available and set the correct device.
Verify Tokenizer Directory: Ensure that the tokenizer_dir provided is correct and contains the necessary tokenizer files.
Check Sampling Configuration: Ensure that the sampling configuration parameters are correctly set.

Enable Debug Mode

To get more insights into the RuntimeError: Bad params error, enable the debug_mode when creating the GenerationSession object:

decoder = tensorrt_llm.runtime.GenerationSession(
    self._model_config, engine_buffer, runtime_mapping, debug_mode=True
)

By following these steps, you should be able to deploy TensorRT LLM integrated with Llama Index RAG modules on your Advantech Jetson Orin NX - 16 GB variant device and troubleshoot any issues that arise ^[1]^[2].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index