Open TheCodeWrangler opened 3 months ago
Tagging @kaiyux @byshiue to help triage and/or add to review board, thanks!
Curious to get any feedback here
This update is also related to a performance issue I am seeing.
https://github.com/NVIDIA/TensorRT-LLM/issues/1957
This PR gets results much closer to the expected outputs but not fully in line with huggingface/ pre-compiled results. Would love to have some feedback on the process for preparation of the adapter weights.
This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend.
This approach allows for the LoRa weights to never be required for the client of the triton-inference-server backend and does not require loading or passing these weights from any of the
python
backend models (preprocessing) to avoid the numpy datatype conversion (which does not supportbfloat16
)