FP8 inference and FP8 KV cache

SinanAkkoyun commented 1 year ago

Feature request

Hi! Could anyone please help me with using HuggingFace models (LLaMa [or if LLaMa is difficult, MPT-7b]) with the TransformerEngine TE FP8 inference? We really need the speedup

https://github.com/NVIDIA/TransformerEngine/issues/199 This is a somewhat related issue to this topic.

Motivation

Faster inference and more specialized tensor operations means less cost and less latency.

Your contribution

I would really love to test suggestions out as I have temporary access to a H100 cloud GPU. I am not sufficient in porting the models myself which is why I created this issue.

I really appreciate any help, thank you very much.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AhsanAli1288 commented 1 year ago

@SinanAkkoyun have you find the solution how to use transformerengine with Llama?

maxpain commented 1 year ago

Any updates?

amyeroberts commented 3 months ago

Gentle ping @fxmarty

amyeroberts commented 2 months ago

Another ping @fxmarty. Could you nominate someone to take this over for you?

amyeroberts commented 1 month ago

cc @IlyasMoutawwakil

huggingface / transformers