Llama 2 Transfomers Neuron X issue

liechtym commented 9 months ago

I was trying to use the generate API for Llama 2 using the same code from this example: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#features

My code:

from transformers_neuronx.llama.model import LlamaForSampling
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter

llama_model_cpu = LlamaForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-chat-hf',
    torch_dtype=torch.float16,
)

llama_model_neuron = LlamaForSampling.from_pretrained('/home/ubuntu/Llama-2-7b-chat-hf-split', batch_size=1, tp_degree=2, amp='f16')
llama_model_neuron.to_neuron()
print('Config: ', llama_model_cpu.config)
llama_model = HuggingFaceGenerationModelAdapter(llama_model_cpu.config, llama_model_neuron)

Error:

Traceback (most recent call last):
  File "modular.py", line 107, in <module>
    chatbot = MiniGPT4LLama2Chatbot(cfg_path, gpu_id)
  File "modular.py", line 62, in __init__
    self.model = model_cls.from_config(model_config)
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt4.py", line 173, in from_config
    model = cls(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt4.py", line 45, in __init__
    super().__init__(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt_base.py", line 43, in __init__
    self.llama_model, self.llama_tokenizer = self.init_llm(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/base_model.py", line 202, in init_llm
    llama_model = HuggingFaceGenerationModelAdapter(llama_model_cpu.config, llama_model_neuron)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/generation_utils.py", line 18, in __init__
    super().__init__(config)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1190, in __init__
    config = self._autoset_attn_implementation(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1311, in _autoset_attn_implementation
    config = cls._check_and_enable_sdpa(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1464, in _check_and_enable_sdpa
    raise ValueError(
ValueError: HuggingFaceGenerationModelAdapter does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new

Is there a work around for this? Or is supporting this attention implemention the only way? I simply want to use the generate api with a neuron-compiled model.

ArthurZucker commented 9 months ago

Sorry I don't think this is related to transformers as there is a wrapper around it. sdpa is natively supported in transformers

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

Llama 2 Transfomers Neuron X issue #28396