Perplexity (ppl) Calculation of Local Sparse Model: NaN issue

👋 Hello Neural Magic community developers,

I encountered an issue while calculating the perplexity for a locally converted Llama3-8B sparse model using the llm-compress library. I'm refer the sparse conversion example script and change model to meta-llama/Meta-Llama-3-8B-Instruct by my self, the sparse conversion need ~ 1.2 hours to finish. Here’s a detailed breakdown:

Describe the bug While trying to compute the WikiText2 Perplexity for a Llama3-8B model that has been sparsified (load local model from disk), the resulting perplexity values always turn out to be NaN. I suspect that some configurations might not be properly set when using the custom SparseAutoModelForCausalLM class in combination with the compressed-tensors library.

Expected behavior I expected the perplexity values to be reasonable and comparable to the official Hugging Face models. For example, when testing with the standard Llama-3.2-3B model from Hugging Face (without sparsification), I got a perplexity of around ~8.8 with the following parameters:

•   max_length=16K
•   stride=1, 2, 4, 8, 16K

I expected similar results for the sparse model, not NaN values.

Environment I use RunPod online env with A100-80GB-SXM *2

To Reproduce Steps to reproduce the behavior:

1.  Convert the Llama3-8B model using llm-compress to a sparse version.
2.  Load the sparse model using **_SparseAutoModelForCausalLM_** (same process [here](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_24_sparse_w4a16) ) and set up the environment to calculate perplexity.
3.  Run perplexity calculation on WikiText2 dataset following Hugging Face’s [official perplexity guide](https://huggingface.co/docs/transformers/perplexity), but using the custom sparse model.
4.  Observe the NaN perplexity values in the output.

Errors Here’s the output I receive when running the perplexity calculation, see the attachment image. The perplexity of local Llama-8B model (load by SparseAutoModelForCausalLM class) always be NaN value. Test with Llama-3B model (load by AutoModelForCausalLM class) can successfully get ppl value.

Sparse Llama 8B (load by SparseAutoModelForCausalLM class) : ppl will be NaN

PerplexityNaNOfSparseLlama8B

Load Online Llama 3B (load by AutoModelForCausalLM class) : successfully get ppl value

PerplexityOfLlama3B

Additional context The same perplexity calculation process works perfectly when using the Hugging Face Llama-3.2-3B model without sparsification, which gives a perplexity value of ~8.8. I believe the issue lies in either the custom sparse model class or the integration with compressed-tensors. Maybe I miss some additional configuration/setting of Sparse model ? 🧐 Any guidance on this would be appreciated! 🥰

Additional Question How to load the final quantization model (i.e the model be saved in _stagequantization folder) correctly ? I also interest of ppl of final quantization model, but I try load with SparseAutoModelForCausalLM it can not be work 😢 it shows some message mean : "... ... class not support ..." So how to load the final quantization model correctly ? any documentation can be refer ? 🙏🏼

Hi @robertgshaw2-neuralmagic Robert, you were right to question this. I retested the original llama-7B Sparse conversion example from llm-compressor today, along with a simple model.generate test to check the model's text output. It turns out the model doesn’t seem to generate any correct outputs, and as expected, I couldn’t calculate the model’s perplexity under these circumstances.

Load local Sparse Llama-7B model

Sparse-Llama2-7B-load

Test Model Output (Ref)

Sparse-Llama2-7B-text-output

Calculating Perplexity

Sparse-Llama2-7B-ppl-calculating

NaN Result

Sparse-Llama2-7B-ppl-result

I think the issue is now clearer. I believe the problem lies in how I load the local Sparse Model & Tokenizer. Does llm-compressor have any examples or documentation I can refer to? Any suggestions would be appreciated, thank you! 🥰

Also, I apologize for not providing the exact sparse model I used. After running it in the online RunPod environment, I didn’t download the model. However, this process should be easy to replicate. Here are the steps I followed for testing:

Step 1: Execute the official llama-7B sparse conversion example from llm-compressor : run python llama7b_sparse_w4a16.py Step 2: After about an hour, the sparse conversion finishes, and you’ll find the model saved in three stages in the output folder output_llama7b_2:4_w4a16_channel and I rename to output_llama7b_2_4_w4a16_channel for easy use. Step 3: Load the stage_finetuing sparse model and Tokenizer in output_llama7b_2_4_w4a16_channel/stage_finetuning, and follow the HuggingFace process to calculate perplexity"

The Success Case with Llama3-3B online model

Llama3-3B-load

Test Model Output

Llama3-3B-text-output

Calculating Perplexity

Llama3-3B-ppl-calculating

Result

Llama3-3B-ppl-result

Summary

I want to correctly load the local sparse model and calculate its perplexity as an evaluation metric. However, it seems that I haven’t used the correct method to load the model (through the SparseAutoModelForCausalLM class) or the Tokenizer. If there are any documents or resources I can refer to, please let me know. Thanks! 🥰

And my testing jupyter notebook is in attatchment. Perplexity of model.zip

vllm-project / llm-compressor