Open SinanAkkoyun opened 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@SinanAkkoyun have you find the solution how to use transformerengine with Llama?
Any updates?
Gentle ping @fxmarty
Another ping @fxmarty. Could you nominate someone to take this over for you?
cc @IlyasMoutawwakil
Feature request
Hi! Could anyone please help me with using HuggingFace models (LLaMa [or if LLaMa is difficult, MPT-7b]) with the TransformerEngine TE FP8 inference? We really need the speedup
https://github.com/NVIDIA/TransformerEngine/issues/199 This is a somewhat related issue to this topic.
Motivation
Faster inference and more specialized tensor operations means less cost and less latency.
Your contribution
I would really love to test suggestions out as I have temporary access to a H100 cloud GPU. I am not sufficient in porting the models myself which is why I created this issue.
I really appreciate any help, thank you very much.