neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.01k stars 140 forks source link

How to export a GPTQ model to ONNX to run in DeepSparse #2293

Open Tangxinlu opened 1 month ago

Tangxinlu commented 1 month ago

Thanks for the great work!

Now I have my own sparsified and GPTQ-quantized model, I'd like to run it in deepsparse to see some inference speedup or other advantages. To export it to ONNX, I tried running https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq#-how-to-export-the-one-shot-model but it seems it doesn't work for GPTQ-quantized model. How do i export a GPTQ model (e.g., TheBloke/Llama-2-7B-Chat-GPTQ) to ONNX model so that it can work in DeepSparse? Thanks.

dbogunowicz commented 1 month ago

Hey @Tangxinlu, the sparseml.export is the appropriate pathway. Could you share your code and stack trace, so that I can reproduce the issue?

Tangxinlu commented 1 month ago

Hi @dbogunowicz, thanks for the quick reply!

Here is an example:

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
huggingface-cli download TechxGenus/Meta-Llama-3-8B-GPTQ --local-dir Meta-Llama-3-8B-GPTQ
# Add `"disable_exllama": true` to `"quantization_config"` in `Meta-Llama-3-8B-GPTQ/config.json`

sparseml.export --task text-generation ./Meta-Llama-3-8B-GPTQ

Error:


...
sparseml/src/sparseml/pytorch/torch_to_onnx_exporter.py", line 100, in pre_validate
    return deepcopy(module).to("cpu").eval()
...
TypeError: cannot pickle 'module' object

envs: