triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

Warmup Example of loading LoRa weights #417

Open TheCodeWrangler opened 2 months ago

TheCodeWrangler commented 2 months ago

Is warmup supported for the tensorrtllm_backend? If so it would be nice to have an example of how to upload LoRa adapters as a warmup step.

byshiue commented 2 months ago

The document is here

TheCodeWrangler commented 2 months ago

I am actually hoping to understand how to perform warmup within the triton-inference-server frame work with the LoRa weights

It would be nice for my client code to not need to be aware of the adapter weights and only need to know the indices.

TheCodeWrangler commented 1 month ago

I have been able to get warmup to load within triton-inference-server and initialize my weights BUT due to the degraded performance of my model outputs I am suspicious I have an issue with the LoRa weights after conversion

I have performed the conversion of my lora weights using the script.

I also am saving out with "npy=False" To get a binary file which I am using for warmup on my lora weights

I have tried both the "model.lora_weights.bin" and the "model.lora_weights.npy" files as input to my warmup. I am using bfloat16 weights so I also had to update the datatype of the lora_weights input to the model.

The model gets through warmup and I am able to call it in inference now by only passing the lora_task_id but I am concerned my adapter weights are not being set correctly (concerned about datatype conversions).

I would appreciate any guidance

My warmup config for the tensorrt_llm model in triton is as follows


model_warmup [
    name: "lora_0_warmup"
    batch_size: 1
    inputs: {
      key: "input_ids"
      value: {
        data_type: TYPE_INT32
        dims: [ 4 ]
        input_data_file: "raw_input_ids"
    inputs: {
      key: "input_lengths"
      value: {
        data_type: TYPE_INT32
        dims: [ 1 ]
        input_data_file: "raw_input_lengths"
    inputs: {
      key: "request_output_len"
      value: {
        data_type: TYPE_UINT32
        dims: [ 1 ]
        input_data_file: "raw_request_output_len"
    inputs: {
      key: "beam_width"
      value: {
        data_type: TYPE_UINT32
        dims: [ 1 ]
        input_data_file: "raw_beam_width"
    inputs: {
      key: "lora_task_id"
      value: {
        data_type: TYPE_UINT64
        dims: [ 1 ]
        input_data_file: "raw_lora_task_id_0"
    inputs: {
      key: "lora_weights"
      value: {
        data_type: TYPE_BF16
        dims: [ 224,  589824]
        input_data_file: "model.lora_weights.bin"
    inputs: {
      key: "lora_config"
      value: {
        data_type: TYPE_INT32
        dims: [ 224, 3 ]
        input_data_file: "model.lora_config.npy"
VincentJing commented 1 month ago

To perform inference with a specific LoRA for the first time, lora_task_id, lora_weights, and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached.

VincentJing commented 1 month ago

As for the performance degradation you mentioned, I wonder if the result is correct? How much has the performance decreased?


TheCodeWrangler commented 1 week ago

I was able to resolve this.. degradation was caused by alpha scaling not being applied by the provided conversation scripts (now resolved in a PR)

I was able to get a warmup file by using the conversion script ( but saving out the .bin instead of the default .npy format.