vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.89k stars 4.11k forks source link

TE FP8 support? #448

Open SinanAkkoyun opened 1 year ago

SinanAkkoyun commented 1 year ago

Hi! Is adding FP8 transformer engine (H100) speedup to inference planned? If not, could you please give me an outline of what needs to be done in order for me to work on that?

Thank you!

WoosukKwon commented 1 year ago

Hi @SinanAkkoyun, thanks for raising the issue! We are not familiar with the Transformer engine, and does not have access to H100 GPUs at the moment. We will let you know after we investigate more on that.

casper-hansen commented 11 months ago

Hi @WoosukKwon, any update on this issue? FLOPS is theoretically 2x of FP16. H100s can quite easily be rented from RunPod, Azure, AWS, or the like.

wuchaooooo commented 8 months ago

Hi @WoosukKwon , I also want to know that when vLLM will support FP8 in H100(H800)?FP8 is 2x faster than FP16.

uncensorie commented 5 months ago

any update on FP8 @WoosukKwon ?

bdambrosio commented 5 months ago

Second this, would be helpful for my batch biomed ontology extraction project.