openvinotoolkit / nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference
Apache License 2.0
917 stars 228 forks source link

IndexError: list index out of range When I try to quantize llama models using OVQuantizer #2755

Closed Alwahsh closed 3 months ago

Alwahsh commented 3 months ago

🐛 Describe the bug

Hello,

I'm trying to quantize llama models using OVQuantizer but I'm facing an error:

IndexError: list index out of range

I tried llama3 and llama2

Environment

about-time==4.2.1
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
alive-progress==3.1.5
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
auto-gptq==0.7.1
autograd==1.6.2
backcall==0.2.0
bitsandbytes==0.42.0
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
cma==3.2.2
coloredlogs==15.0.1
comm==0.2.2
contourpy==1.1.1
cycler==0.12.1
datasets==2.19.2
debugpy==1.8.1
decorator==5.1.1
Deprecated==1.2.14
dill==0.3.6
docker-pycreds==0.4.0
evaluate==0.4.2
executing==2.0.1
filelock==3.14.0
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.3.1
future==1.0.0
gekko==1.1.1
gitdb==4.0.11
GitPython==3.1.43
grapheme==0.6.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
importlib-metadata==7.1.0
importlib-resources==6.4.0
inquirerpy==0.3.4
ipykernel==6.29.4
ipython==8.12.3
jedi==0.19.1
jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jstyleson==0.0.2
jupyter-client==8.6.2
jupyter-core==5.7.2
kiwisolver==1.4.5
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.7.5
matplotlib-inline==0.1.7
mdurl==0.1.2
ml-dtypes==0.2.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
natsort==8.4.0
nest-asyncio==1.6.0
networkx==3.1
ninja==1.10.2.4
-e git+https://github.com/openvinotoolkit/nncf.git@544d51417c221245bf52220905c9287d76b1ed31#egg=nncf
numpy==1.24.4
nvidia-cublas-cu11==11.11.3.6
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.7.0.84
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.3.0.86
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.5.86
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.20.5
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu11==11.8.86
nvidia-nvtx-cu12==12.1.105
onnx==1.16.1
onnxscript==0.1.0.dev20240611
openvino==2024.2.0
openvino-telemetry==2024.1.0
optimum==1.20.0
optimum-intel==1.18.0.dev0+906668b
packaging==24.0
pandas==2.0.3
parso==0.8.4
-e git+https://github.com/huggingface/peft.git@8221246f2f48b665c853605e1d0ddaf1d0ce39c9#egg=peft
pexpect==4.9.0
pfzy==0.3.4
pickleshare==0.7.5
pillow==10.3.0
pkgutil-resolve-name==1.3.10
platformdirs==4.2.2
prompt-toolkit==3.0.46
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==16.1.0
pyarrow-hotfix==0.6
pydot==2.0.0
pygments==2.18.0
pymoo==0.6.1.1
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
responses==0.18.0
rich==13.7.1
rouge==1.0.1
rpds-py==0.18.1
safetensors==0.4.3
scikit-learn==1.3.2
scipy==1.10.1
sentencepiece==0.2.0
sentry-sdk==2.4.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
stack-data==0.6.3
sympy==1.12.1
texttable==1.7.0
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.1.2+cu118
tornado==6.4.1
tqdm==4.66.4
traitlets==5.14.3
transformers==4.41.2
triton==2.1.0
typing-extensions==4.12.1
tzdata==2024.1
urllib3==2.2.1
wandb==0.17.0
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2

Minimal Reproducible Example

from functools import partial

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig

from openvino.runtime import Core

model_name = 'meta-llama/Meta-Llama-3-8B'

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples, tokenizer):
    """
    Define a function that tokenizes the data and returns it in the format expected by the model.

    :param: examples: a dictionary containing the input data which are the items from caliration dataset.
            tokenizer: a tokenizer object that is used to tokenize the text data.
    :returns:
            the data that can be fed directly to the model.
    """
    return tokenizer(
        examples["text"], max_length=128, truncation=True
    )

# Create quantization config (default) and OVQuantizer
# OVConfig is a wrapper class on top of NNCF config. 
# Use "compression" field to control quantization parameters
# For more information about the parameters refer to NNCF GitHub documentatioin
quantization_config = OVConfig()
quantizer = OVQuantizer.from_pretrained(model, task='text-generation')

# Instantiate a dataset and convert it to calibration dataset using HF API
# The latter one produces a model input
dataset = load_dataset("wikitext", 'wikitext-103-raw-v1')
calibration_dataset = quantizer.get_calibration_dataset(
    "wikitext",
    dataset_config_name='wikitext-103-raw-v1',
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    calibration_dataset=calibration_dataset, save_directory='Meta-Llama-3-8B'
)

Are you going to submit a PR?

nikita-savelyevv commented 3 months ago

@Alwahsh I see you use an old version of NNCF dating back February 21. The issue you're facing was fixed in the later version of NNCF. I would suggest you to update to the latest release.

As a side note. In the code example you're trying to apply PTQ to an LLM model. In general this significantly worsens the generation quality. Quantizing LLM activations introduces significant quantization errors due to activation values ranges being drastically different across channels. That's why at the moment we only apply weight compression to LLM models.

Alwahsh commented 3 months ago

Upgrading resolved the problem. Thanks for the suggestion and the side note. You're right but I'm trying to make use of a feature that requires activations to be in Int8