How to export a GPTQ model to ONNX to run in DeepSparse

Tangxinlu commented 6 months ago

Thanks for the great work!

Now I have my own sparsified and GPTQ-quantized model, I'd like to run it in deepsparse to see some inference speedup or other advantages. To export it to ONNX, I tried running https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq#-how-to-export-the-one-shot-model but it seems it doesn't work for GPTQ-quantized model. How do i export a GPTQ model (e.g., TheBloke/Llama-2-7B-Chat-GPTQ) to ONNX model so that it can work in DeepSparse? Thanks.

dbogunowicz commented 6 months ago

Hey @Tangxinlu, the sparseml.export is the appropriate pathway. Could you share your code and stack trace, so that I can reproduce the issue?

Tangxinlu commented 6 months ago

Hi @dbogunowicz, thanks for the quick reply!

Here is an example:

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
huggingface-cli download TechxGenus/Meta-Llama-3-8B-GPTQ --local-dir Meta-Llama-3-8B-GPTQ
# Add `"disable_exllama": true` to `"quantization_config"` in `Meta-Llama-3-8B-GPTQ/config.json`

sparseml.export --task text-generation ./Meta-Llama-3-8B-GPTQ

Error:


...
sparseml/src/sparseml/pytorch/torch_to_onnx_exporter.py", line 100, in pre_validate
    return deepcopy(module).to("cpu").eval()
...
TypeError: cannot pickle 'module' object

envs:

torch 2.1.2
transformers 4.39.1
onnx 1.14.1
onnxruntime 1.17.3
sparseml-nightly 1.8.0.20240520

neuralmagic / sparseml

How to export a GPTQ model to ONNX to run in DeepSparse #2293