flozi00 commented 1 month ago

What does this PR do?

@tgaddair its just for you, tracking progress now, please do not merge at the moment

This PR also introduces FP8 Linear and fp8 kv cache by vllm

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Was this discussed/approved via a Github issue or the discord / slack channel? Please add a link to it if that's the case.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

flozi00 commented 1 month ago

Mistral + eetq tested and working

flozi00 commented 1 month ago

llama tested too

flozi00 commented 1 month ago

Benchmark vs Main branch: {"input_tokens_per_second": 14643, "output_tokens_per_second": 218} -- main {"input_tokens_per_second": 15003, "output_tokens_per_second": 236} -- This PR

flozi00 commented 1 month ago

awq tested Sharding tested

predibase / lorax

start porting latest tgi #480

What does this PR do?

Before submitting

Who can review?