triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

Example of LoRa weights #399

Open TheCodeWrangler opened 5 months ago

TheCodeWrangler commented 5 months ago

I would like to send Lora weights through to a compiled tensor rt llm model but am unsure how to load the .bin weights and pass them to Triton. An example of using them and passing in weights would be very helpful

byshiue commented 5 months ago

Here is example https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching

TheCodeWrangler commented 5 months ago

Thank you for pointing me to this! Things that this helped clear up (and may help someone in the future).

Starting with .safetensors from hugingface you need to convert them to .bin adaptors

import torch
from safetensors.torch import load_file

torch.save(load_file("adapter_model.safetensors"), "adapter_model.bin")`

Then you need to convert that into and .npy format by using the examples/hf_lora_convert.py