Open TheCodeWrangler opened 2 months ago
I am actually hoping to understand how to perform warmup within the triton-inference-server frame work with the LoRa weights
It would be nice for my client code to not need to be aware of the adapter weights and only need to know the indices.
I have been able to get warmup to load within triton-inference-server and initialize my weights BUT due to the degraded performance of my model outputs I am suspicious I have an issue with the LoRa weights after conversion
I have performed the conversion of my lora weights using the hf_lora_convert.py
script.
I also am saving out with "npy=False" https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py#L41 To get a binary file which I am using for warmup on my lora weights
I have tried both the "model.lora_weights.bin" and the "model.lora_weights.npy" files as input to my warmup. I am using bfloat16 weights so I also had to update the datatype of the lora_weights input to the model.
The model gets through warmup and I am able to call it in inference now by only passing the lora_task_id
but I am concerned my adapter weights are not being set correctly (concerned about datatype conversions).
I would appreciate any guidance
My warmup config for the tensorrt_llm
model in triton is as follows
...
model_warmup [
{
name: "lora_0_warmup"
batch_size: 1
inputs: {
key: "input_ids"
value: {
data_type: TYPE_INT32
dims: [ 4 ]
input_data_file: "raw_input_ids"
}
}
inputs: {
key: "input_lengths"
value: {
data_type: TYPE_INT32
dims: [ 1 ]
input_data_file: "raw_input_lengths"
}
}
inputs: {
key: "request_output_len"
value: {
data_type: TYPE_UINT32
dims: [ 1 ]
input_data_file: "raw_request_output_len"
}
}
inputs: {
key: "beam_width"
value: {
data_type: TYPE_UINT32
dims: [ 1 ]
input_data_file: "raw_beam_width"
}
}
inputs: {
key: "lora_task_id"
value: {
data_type: TYPE_UINT64
dims: [ 1 ]
input_data_file: "raw_lora_task_id_0"
}
}
inputs: {
key: "lora_weights"
value: {
data_type: TYPE_BF16
dims: [ 224, 589824]
input_data_file: "model.lora_weights.bin"
}
}
inputs: {
key: "lora_config"
value: {
data_type: TYPE_INT32
dims: [ 224, 3 ]
input_data_file: "model.lora_config.npy"
}
}
}
]
To perform inference with a specific LoRA for the first time, lora_task_id, lora_weights, and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached.
As for the performance degradation you mentioned, I wonder if the result is correct? How much has the performance decreased?
Thanks.
I was able to resolve this.. degradation was caused by alpha scaling not being applied by the provided conversation scripts (now resolved in a PR)
I was able to get a warmup file by using the conversion script (hf_lora_convert.py) but saving out the .bin instead of the default .npy format.
Is warmup supported for the
tensorrtllm_backend
? If so it would be nice to have an example of how to upload LoRa adapters as a warmup step.