optimum-neuron 0.0.23 has failure with Llama2 that was fine on 0.0.22 - HF DLAMI

System Info

The original environment was HF DLAMI 20240531 with optimum-neuron 0.0.23

To get the successful example below, I ran:

pip install optimum-neuron==0.0.22

That is what is reflected in this output.

Platform:

- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10

Python packages:

- `optimum-neuron` version: 0.0.22
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.18.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

Who can help?

@dacorvo @JingyaHuang

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

The following code works on optimum 0.0.22, but failed on 0.0.23. Fails on HF DLAMI 20240531 on inf2.8xlarge. (worked on the previous HF DLAMI).

Code:

model_to_test = "NousResearch/Llama-2-7b-chat-hf"
num_cores = 2
sequence_length = 4096
batch_size=2

from optimum.neuron import NeuronModelForCausalLM
from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

compiler_args = {"num_cores": num_cores, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": batch_size, "sequence_length": sequence_length}

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes)

On 0.0.23:

>>> model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes)
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 583/583 [00:00<00:00, 1.30MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 58.4MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [00:42<00:00, 236MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:14<00:00, 244MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:56<00:00, 28.44s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.53it/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 186kB/s]
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:515: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Traceback (most recent call last):
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py", line 695, in save_pretrained
    raise ValueError(str([w.message for w in caught_warnings]))
ValueError: [UserWarning('`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.'), UserWarning('`do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers
    return cls._export(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 358, in _export
    checkpoint_dir = cls._create_checkpoint(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_decoder.py", line 256, in _create_checkpoint
    model.save_pretrained(checkpoint_dir.name)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2475, in save_pretrained
    model_to_save.generation_config.save_pretrained(save_directory)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py", line 697, in save_pretrained
    raise ValueError(
ValueError: The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. Fix these issues to save the configuration.

Thrown during validation:
[UserWarning('`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.'), UserWarning('`do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.')]
>>> save_dir=("./BrokenLlama-2-7b-chat-hf-cores-" + str(num_cores)+ "-sq-" + str(sequence_length) + "-bs-" )
>>> model.save_pretrained(save_dir + str(batch_size))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'model' is not defined

Working on 0.0.22:

>>> model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.96it/s]
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:557: UserWarning: The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. Fix these issues to save the configuration. This warning will be raised to an exception in v4.34.

Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
2024-06-01 02:48:14.000183:  3274  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000265:  3364  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000313:  3365  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000665:  3364  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_af97e15eb5b056af300b+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-06-01 02:48:33.000668:  3364  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-06-01 02:48:33.000893:  3365  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/MODULE_161d550e91fe728b06bd+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-06-01 02:48:33.000897:  3365  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-Jun-01 02:48:36.0206 3274:3362 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jun-01 02:48:36.0206 3274:3362 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?

Expected behavior

Included above. Expected behavior should be the same success as on 0.0.22

huggingface / optimum-neuron

optimum-neuron 0.0.23 has failure with Llama2 that was fine on 0.0.22 - HF DLAMI #617