predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
1.86k stars 125 forks source link

start porting latest tgi #480

Closed flozi00 closed 1 month ago

flozi00 commented 1 month ago

What does this PR do?

@tgaddair its just for you, tracking progress now, please do not merge at the moment

This PR also introduces FP8 Linear and fp8 kv cache by vllm

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

flozi00 commented 1 month ago

Mistral + eetq tested and working

flozi00 commented 1 month ago

llama tested too

flozi00 commented 1 month ago

Benchmark vs Main branch: {"input_tokens_per_second": 14643, "output_tokens_per_second": 218} -- main {"input_tokens_per_second": 15003, "output_tokens_per_second": 236} -- This PR

flozi00 commented 1 month ago

awq tested Sharding tested