huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.54k stars 455 forks source link

BetterTransformer inference GPT-NeoX and LLaMa is not faster than huggingface transformers #1051

Closed cdj0311 closed 1 year ago

cdj0311 commented 1 year ago

hi, I inference GPT-NeoX and LLaMa-7B with BetterTransformer,but get the same latency with huggingface transformers, Python: 3.10 PyTorch: 2.0 CUDA: 11.7 transformers: 4.29 optimum: lasest

my code as follows:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.bettertransformer import BetterTransformer

checkpoint = "gpt-neox-6b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

with torch.no_grad():
    model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto", load_in_8bit=True)
    model = BetterTransformer.transform(model_hf, keep_original_model=False)

    text = "write flask server with python"
    inputs = tokenizer.encode(text, return_tensors="pt").to("cuda")

    outputs = model.generate(inputs, max_new_tokens=100, num_beams=3, num_return_sequences=3)

    for i in range(len(outputs)):
        print(tokenizer.decode(outputs[i]))
KeremTurgutlu commented 1 year ago

I am also interested in this. I don't think LLAMA is yet supported by better transformer here is the error I got:

NotImplementedError: The model type llama is not yet supported to be used with BetterTransformer. 

Feel free to open an issue at https://github.com/huggingface/optimum/issues if you would like this model type to be supported.

Currently supported models are: dict_keys(['albert', 'bart', 'bert', 'bert-generation', 'blenderbot', 'camembert', 'clip', 'codegen', 'data2vec-text', 'deit', 'distilbert', 'electra', 'ernie', 'fsmt', 'gpt2', 'gptj', 'gpt_neo', 'gpt_neox', 'hubert', 'layoutlm', 'm2m_100', 'marian', 'markuplm', 'mbart', 'opt', 'pegasus', 'rembert', 'prophetnet', 'roberta', 'roc_bert', 'roformer', 'splinter', 'tapas', 't5', 'vilt', 'vit', 'vit_mae', 'vit_msn', 'wav2vec2', 'whisper', 'xlm-roberta', 'yolos']).

Perhaps the reason is that it uses RoPE - here is a reference which uses flash attention in a similar model.

RiskySignal commented 1 year ago

@KeremTurgutlu Solved by python -m pip install git+https://github.com/huggingface/optimum.git.

KeremTurgutlu commented 1 year ago

@KeremTurgutlu Solved by python -m pip install git+https://github.com/huggingface/optimum.git.

Thanks! I will immediately try it out ☺️

RiskySignal commented 1 year ago

I tried LLaMA-30B with BetterTransformer on Multi-GPU(8*A100, 8*V100 and 8*P40), found that it brought 11.2% acceleration on A100, had no impact on V100, and even perform 'worse' on p40.

KeremTurgutlu commented 1 year ago

I tried LLaMA-30B with BetterTransformer on Multi-GPU(8*A100, 8*V100 and 8*P40), found that it brought 11.2% acceleration on A100, had no impact on V100, and even perform 'worse' on p40.

Great, thanks for sharing! Inference or training? If training what distiribution and/or partitioning strategy did you use.

RiskySignal commented 1 year ago

Only for inference

RiskySignal commented 1 year ago

And I found that if you are using BetterTransformer with PyTorch2.0 like this:

model = torch.compile(model)
model = BetterTransformer.transform(model)

You would encounter AttributeError: '_hf_hook'. Instead, you should change the order like this:

model = BetterTransformer.transform(model)
model = torch.compile(model)

It would work.

cdj0311 commented 1 year ago

And I found that if you are using BetterTransformer with PyTorch2.0 like this:

model = torch.compile(model)
model = BetterTransformer.transform(model)

You would encounter AttributeError: '_hf_hook'. Instead, you should change the order like this:

model = BetterTransformer.transform(model)
model = torch.compile(model)

It would work.

hi, I use this code, however the inference speed not increase.

RiskySignal commented 1 year ago

@cdj0311 Check the CUDA devices first, I have show my result above.