noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Add NVIDIA TRT-LLM integration #68

Closed aerdem4 closed 7 months ago

aerdem4 commented 8 months ago

Added TRT-LLM support and made a fix to prevent long freezing times when tokenizers have very long tokens.

noamgat commented 8 months ago

Wow, this is a great contribution, thanks! Can you adapt one of the sample notebooks to use trt-llm as the LLM API, so we have a sample on how to use it and an easy way to test support when trt-llm updates are released?

aerdem4 commented 7 months ago

Thank you! Just added a sample notebook to the PR. Please let me know if it is good example.

noamgat commented 7 months ago

I merged it into a side branch so I can easily make changes before I approve. My goal is to have a colab-ready notebook so people can try it out before downloading. Is this even possible with trtllm? After I installed the package in a colab notebook with !pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm I was able to import the library, but only the Module class. The LLM class did not exist. From the triton readme it looks like this can only run in nvidia containers? If so, can you please provide instructions on how I can test this before approving, and what is the easiest way for users to test it?

aerdem4 commented 7 months ago

Thanks! You can use --pre for getting the latest version: !pip install tensorrt_llm --pre --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu122

I have tested it on my local machine using Kaggle GPU Docker Image. So NVIDIA container is not a must. You may need these before installing tensorrt_llm if you want to use Kaggle Docker Image and replicate my notebook:

!sudo apt-get update --allow-releaseinfo-change
!apt-get update && apt-get -y install openmpi-bin libopenmpi-dev
!pip install torch torchvision -U
noamgat commented 7 months ago

OK, I'm making some progress. I was able to get [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600 working with the LLM class in free colab. Will report if I am able to load a model and actually run something. I saw there are some tensorrt-llm models ready on hf hub, but I think there's a version mismatch on what they were build with.

noamgat commented 7 months ago

colab_trtllm_integration.zip This is where I'm at right now. If you want to try to continue to get it working on free google colab, I'll be very happy! I'll try to continue as well. The reason that I insist on having a free colab notebook for each integration is that it makes potential users be able to check it out instantly in their browser, and allows me to check regression problems much easier when time passes.

aerdem4 commented 7 months ago

Unfortunately, it doesn't work on Colab due to RAM limitations. It is using around 20 GB during the build process and Colab has only 12 GB. Then I switched to Kaggle Notebooks which has 29 GB RAM but then it got an error on GPU RAM which is around 15 GB. The notebook works on my local machine with 64 GB RAM and NVIDIA RTX3090.

noamgat commented 7 months ago

Maybe a lighter model will work? I'm not sure which ones are compatible with Tensorrt-llm

On Thu, Feb 15, 2024, 09:54 Ahmet Erdem @.***> wrote:

Unfortunately, it doesn't work on Colab due to RAM limitations. It is using around 20 GB during the build process and Colab has only 12 GB. Then I switched to Kaggle Notebooks which has 29 GB RAM but then it got an error on GPU RAM which is around 15 GB. The notebook works on my local machine with 64 GB RAM and NVIDIA RTX3090.

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/pull/68#issuecomment-1945530987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2A3JL4B3ZCZTYV2KATYTW5MNAVCNFSM6AAAAABC36X64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBVGUZTAOJYG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

aerdem4 commented 7 months ago

I am using High Level API of TRT-LLM and unfortunately HLAPI only supports Llama models so far. It is in beta stage and I believe it will support the other models and quantization in the future. But for now, it seems we can only demo with the smallest Llama model, which is probably 7B, right?

noamgat commented 7 months ago

OK, I see that the LLM class which you used, is new to 0.9.1, but that 0.7.1 (current stable version) has support for logits processing using the ModelRunner class. Any chance of modifying the notebook to use that, so it runs on the current stable version?

noamgat commented 7 months ago

I got the demo working on my computer, modified it in a way that it should work on colab pro. Merged into the main branch and it will be part of the next release in a few days. Thank you so much for the contribution!

aerdem4 commented 7 months ago

Thank you very much! I hope it will be useful to the community.

diandianliu commented 4 months ago

I am using High Level API of TRT-LLM and unfortunately HLAPI only supports Llama models so far. It is in beta stage and I believe it will support the other models and quantization in the future. But for now, it seems we can only demo with the smallest Llama model, which is probably 7B, right?

Hi,Does it supports qwen models inference by TRT-LLM @aerdem4