Minor Update for Model Warmup

Context

I'm running aviary (ray-llm) in the gpu instances rent from an online platform which only provides the running docker container. Therefore I can't directly run the "anyscale/aviary" image inside the instance but install all dependencies locally.

The text-generation-inference I installed is the one cloned from this repo: https://github.com/Yard1/text-generation-inference/tree/main, the one listed in aviary README.md. The text-generation-inference version is 0.9.4.

However, I have encountered the error as shown in the below image while the model is doing the warmup: Screenshot 2023-09-28 at 10 35 27 PM

By checking the source code of text-generation-server, I found that no matter the normal model or flash_attention based model, their warmup functions only takes the batch parameter besides self. In the version <= 0.9.3, the warmup function takes parameters: (self, batch, max_total_tokens).

The warmup functions of v0.9.4 are shown below: Screenshot 2023-09-28 at 11 57 46 PM Screenshot 2023-09-29 at 12 03 07 AM

Change

Only one line change in /aviary/backend/llm/continuous/tgi/tgi_worker.py: from

suggested_max_batch_total_tokens = self._model.warmup(
    batch_state, max_batch_total_tokens
)

suggested_max_batch_total_tokens = self._model.warmup(batch_state)

Verification

Verified within my gpu instance with model light-GPT and llama2-13b-cht-hf

ray-project / ray-llm