huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
195 stars 59 forks source link

optimum-neuron 0.0.23 has failure with Llama2 that was fine on 0.0.22 - HF DLAMI #617

Closed jimburtoft closed 3 months ago

jimburtoft commented 4 months ago

System Info

The original environment was HF DLAMI 20240531 with optimum-neuron 0.0.23

To get the successful example below, I ran:

pip install optimum-neuron==0.0.22

That is what is reflected in this output.

Platform:

- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10

Python packages:

- `optimum-neuron` version: 0.0.22
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.18.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

Who can help?

@dacorvo @JingyaHuang

Information

Tasks

Reproduction (minimal, reproducible, runnable)

The following code works on optimum 0.0.22, but failed on 0.0.23. Fails on HF DLAMI 20240531 on inf2.8xlarge. (worked on the previous HF DLAMI).

Code:

model_to_test = "NousResearch/Llama-2-7b-chat-hf"
num_cores = 2
sequence_length = 4096
batch_size=2

from optimum.neuron import NeuronModelForCausalLM
from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

compiler_args = {"num_cores": num_cores, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": batch_size, "sequence_length": sequence_length}

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

On 0.0.23:

>>> model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes)
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 583/583 [00:00<00:00, 1.30MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 58.4MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [00:42<00:00, 236MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:14<00:00, 244MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:56<00:00, 28.44s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.53it/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 186kB/s]
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Traceback (most recent call last):
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py", line 695, in save_pretrained
    raise ValueError(str([w.message for w in caught_warnings]))
ValueError: [UserWarning('`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.'), UserWarning('`do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers
    return cls._export(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 358, in _export
    checkpoint_dir = cls._create_checkpoint(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 256, in _create_checkpoint
    model.save_pretrained(checkpoint_dir.name)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2475, in save_pretrained
    model_to_save.generation_config.save_pretrained(save_directory)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py", line 697, in save_pretrained
    raise ValueError(
ValueError: The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. Fix these issues to save the configuration.

Thrown during validation:
[UserWarning('`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.'), UserWarning('`do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.')]
>>> save_dir=("./BrokenLlama-2-7b-chat-hf-cores-" + str(num_cores)+ "-sq-" + str(sequence_length) + "-bs-" )
>>> model.save_pretrained(save_dir + str(batch_size))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'model' is not defined

Working on 0.0.22:

>>> model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.96it/s]
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:557: UserWarning: The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. Fix these issues to save the configuration. This warning will be raised to an exception in v4.34.

Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
2024-06-01 02:48:14.000183:  3274  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000265:  3364  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000313:  3365  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000665:  3364  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_af97e15eb5b056af300b+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-06-01 02:48:33.000668:  3364  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000893:  3365  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_161d550e91fe728b06bd+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-06-01 02:48:33.000897:  3365  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-Jun-01 02:48:36.0206 3274:3362 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jun-01 02:48:36.0206 3274:3362 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?

Expected behavior

Included above. Expected behavior should be the same success as on 0.0.22

dacorvo commented 4 months ago

This model generation config is invalid (always has been), and the latest version of transformers rejects it when we try to save it during export. https://github.com/huggingface/transformers/blob/96eb06286b63c9c93334d507e632c175d6ba8b28/src/transformers/generation/configuration_utils.py#L720 What could be done is to detect the invalid generation_config during export and skip it (in that case a new one would be created from the model).

dacorvo commented 4 months ago

I created a pull-request to update the model generation config: https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/discussions/9 And here is a pull-request with a workaround for optimum-neuron to be included in the next release: https://github.com/huggingface/optimum-neuron/pull/618

dacorvo commented 3 months ago

@jimburtoft since the Nous models have been updated, can we close this ?