triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

Warmup Example of loading LoRa weights #417

Open TheCodeWrangler opened 2 months ago

TheCodeWrangler commented 2 months ago

Is warmup supported for the tensorrtllm_backend? If so it would be nice to have an example of how to upload LoRa adapters as a warmup step.

byshiue commented 2 months ago

The document is here https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching

TheCodeWrangler commented 2 months ago

I am actually hoping to understand how to perform warmup within the triton-inference-server frame work with the LoRa weights

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup

It would be nice for my client code to not need to be aware of the adapter weights and only need to know the indices.

TheCodeWrangler commented 1 month ago

I have been able to get warmup to load within triton-inference-server and initialize my weights BUT due to the degraded performance of my model outputs I am suspicious I have an issue with the LoRa weights after conversion

I have performed the conversion of my lora weights using the hf_lora_convert.py script.

I also am saving out with "npy=False" https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py#L41 To get a binary file which I am using for warmup on my lora weights

I have tried both the "model.lora_weights.bin" and the "model.lora_weights.npy" files as input to my warmup. I am using bfloat16 weights so I also had to update the datatype of the lora_weights input to the model.

The model gets through warmup and I am able to call it in inference now by only passing the lora_task_id but I am concerned my adapter weights are not being set correctly (concerned about datatype conversions).

I would appreciate any guidance

My warmup config for the tensorrt_llm model in triton is as follows


...

model_warmup [
  {
    name: "lora_0_warmup"
    batch_size: 1
    inputs: {
      key: "input_ids"
      value: {
        data_type: TYPE_INT32
        dims: [ 4 ]
        input_data_file: "raw_input_ids"
      }
    }
    inputs: {
      key: "input_lengths"
      value: {
        data_type: TYPE_INT32
        dims: [ 1 ]
        input_data_file: "raw_input_lengths"
      }
    }
    inputs: {
      key: "request_output_len"
      value: {
        data_type: TYPE_UINT32
        dims: [ 1 ]
        input_data_file: "raw_request_output_len"
      }
    }
    inputs: {
      key: "beam_width"
      value: {
        data_type: TYPE_UINT32
        dims: [ 1 ]
        input_data_file: "raw_beam_width"
      }
    }
    inputs: {
      key: "lora_task_id"
      value: {
        data_type: TYPE_UINT64
        dims: [ 1 ]
        input_data_file: "raw_lora_task_id_0"
      }
    }
    inputs: {
      key: "lora_weights"
      value: {
        data_type: TYPE_BF16
        dims: [ 224,  589824]
        input_data_file: "model.lora_weights.bin"
      }
    }
    inputs: {
      key: "lora_config"
      value: {
        data_type: TYPE_INT32
        dims: [ 224, 3 ]
        input_data_file: "model.lora_config.npy"
      }
    }
  }
]
VincentJing commented 1 month ago

To perform inference with a specific LoRA for the first time, lora_task_id, lora_weights, and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached.

VincentJing commented 1 month ago

As for the performance degradation you mentioned, I wonder if the result is correct? How much has the performance decreased?

Thanks.

TheCodeWrangler commented 1 week ago

I was able to resolve this.. degradation was caused by alpha scaling not being applied by the provided conversation scripts (now resolved in a PR)

I was able to get a warmup file by using the conversion script (hf_lora_convert.py) but saving out the .bin instead of the default .npy format.