Closed cdj0311 closed 1 year ago
I am also interested in this. I don't think LLAMA is yet supported by better transformer here is the error I got:
NotImplementedError: The model type llama is not yet supported to be used with BetterTransformer.
Feel free to open an issue at https://github.com/huggingface/optimum/issues if you would like this model type to be supported.
Currently supported models are: dict_keys(['albert', 'bart', 'bert', 'bert-generation', 'blenderbot', 'camembert', 'clip', 'codegen', 'data2vec-text', 'deit', 'distilbert', 'electra', 'ernie', 'fsmt', 'gpt2', 'gptj', 'gpt_neo', 'gpt_neox', 'hubert', 'layoutlm', 'm2m_100', 'marian', 'markuplm', 'mbart', 'opt', 'pegasus', 'rembert', 'prophetnet', 'roberta', 'roc_bert', 'roformer', 'splinter', 'tapas', 't5', 'vilt', 'vit', 'vit_mae', 'vit_msn', 'wav2vec2', 'whisper', 'xlm-roberta', 'yolos']).
Perhaps the reason is that it uses RoPE - here is a reference which uses flash attention in a similar model.
@KeremTurgutlu Solved by python -m pip install git+https://github.com/huggingface/optimum.git
.
@KeremTurgutlu Solved by
python -m pip install git+https://github.com/huggingface/optimum.git
.
Thanks! I will immediately try it out ☺️
I tried LLaMA-30B with BetterTransformer on Multi-GPU(8*A100, 8*V100 and 8*P40), found that it brought 11.2% acceleration on A100, had no impact on V100, and even perform 'worse' on p40.
I tried LLaMA-30B with BetterTransformer on Multi-GPU(8*A100, 8*V100 and 8*P40), found that it brought 11.2% acceleration on A100, had no impact on V100, and even perform 'worse' on p40.
Great, thanks for sharing! Inference or training? If training what distiribution and/or partitioning strategy did you use.
Only for inference
And I found that if you are using BetterTransformer with PyTorch2.0 like this:
model = torch.compile(model)
model = BetterTransformer.transform(model)
You would encounter AttributeError: '_hf_hook'
.
Instead, you should change the order like this:
model = BetterTransformer.transform(model)
model = torch.compile(model)
It would work.
And I found that if you are using BetterTransformer with PyTorch2.0 like this:
model = torch.compile(model) model = BetterTransformer.transform(model)
You would encounter
AttributeError: '_hf_hook'
. Instead, you should change the order like this:model = BetterTransformer.transform(model) model = torch.compile(model)
It would work.
hi, I use this code, however the inference speed not increase.
@cdj0311 Check the CUDA devices first, I have show my result above.
hi, I inference GPT-NeoX and LLaMa-7B with BetterTransformer,but get the same latency with huggingface transformers, Python: 3.10 PyTorch: 2.0 CUDA: 11.7 transformers: 4.29 optimum: lasest
my code as follows: