QUESTION: LLaMA optimizations used?

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.9k stars 1.05k forks source link

QUESTION: LLaMA optimizations used? #380

Closed SinanAkkoyun closed 1 year ago

SinanAkkoyun commented 1 year ago

Hi! I wanted to know what LLaMA optimizations are being used here? It runs super fast!

OlivierDehaene commented 1 year ago

use flash attention
no padding
fused rotary embeddings
fused rms norm

The implementation is here if you want to have a look.

SinanAkkoyun commented 1 year ago

@OlivierDehaene Thank you! ❤️ Is FP8 support coming soon or can I somehow help on implementing it (Transformer Engine H100)?

OlivierDehaene commented 1 year ago

Right now I don't have access to hopper gpus so it will have to wait unfortunately...

SinanAkkoyun commented 1 year ago

@OlivierDehaene Lambda provides "infinite" H100 GPUs for around $2.40/h https://lambdalabs.com/

Would you be okay with working on FP8 support?

SinanAkkoyun commented 1 year ago

If I can help in any way please let me know

SinanAkkoyun commented 1 year ago

Dear @OlivierDehaene ,

Thank you very much for all of your work on text-generation-inference!

Recognizing the effort required, I'm offering to cover the Lambda costs for a H100 instance for your use during implementation. Could you please estimate the timeline for this feature?

Thank you for considering this proposal. Best Regards

OlivierDehaene commented 1 year ago

I'm sorry to disapoint but OSS (or at least this repository) doesn't work like this.

SinanAkkoyun commented 1 year ago

@OlivierDehaene I believe that the FP8 inference speedup will greatly benefit many consumers, especially as the H100 will get more accessible with time. I am also working with the folks at lit-llama to try to get FP8 to work.

With the time estimation I just wanted to get a feeling of how much I would spend, but if you do not want to accept this offer I can totally understand that.

Just so that I know, will you ever be interested in implementing the FP8 speedup yourself? If not, I would be happy to try my best to do a PR, I am just too inexperienced, that's why I asked you. If you do not plan to do it yourself but are happy if someone would do a PR, could you please give me a high level outline of what I would roughly need to do to implement it here safely?

Thank you very much

SinanAkkoyun commented 1 year ago

I myself am only an individual but I am fascinated by open source LLMs catching up to "OpenAI", with great speed like the turbo models we would be one step closer

Narsil commented 1 year ago

Hi @SinanAkkoyun,

Thanks for the enthusiasm. In open source, if you really want something and cannot wait, just code it yourself.

If you want to have your feature merged because it's cool it usually requires some discussions with the maintainers (is it useful to add, is the code you wrote correct etc..) This may require some time.

Here it would definitely be cool to have FP8 support if it can provide some inference speedups, however, right now this is not our focus right now. The best course of action for you is to start coding yourself, on machines you own, and create a PR.

Don't hesitate to create PRs early so we can discuss architecture before too much work is put into it (we have large codebase changes coming up).

If those options are not available, just wait patiently that it rises high enough on our priority list that we actually tackle it (Right now Falcon is much higher and some refacto)