Open HengJayWang opened 3 days ago
Can you share the model and perhaps some text output from the model? Does the text look reasonable?
Hi @robertgshaw2-neuralmagic Robert, you were right to question this. I retested the original llama-7B Sparse conversion example from llm-compressor today, along with a simple model.generate
test to check the model's text output. It turns out the model doesnāt seem to generate any correct outputs, and as expected, I couldnāt calculate the modelās perplexity under these circumstances.
I think the issue is now clearer. I believe the problem lies in how I load the local Sparse Model & Tokenizer. Does llm-compressor
have any examples or documentation I can refer to? Any suggestions would be appreciated, thank you! š„°
Also, I apologize for not providing the exact sparse model I used. After running it in the online RunPod environment, I didnāt download the model. However, this process should be easy to replicate. Here are the steps I followed for testing:
Step 1: Execute the official llama-7B sparse conversion example from llm-compressor : run python llama7b_sparse_w4a16.py
Step 2: After about an hour, the sparse conversion finishes, and youāll find the model saved in three stages in the output folder output_llama7b_2:4_w4a16_channel
and I rename to output_llama7b_2_4_w4a16_channel
for easy use.
Step 3: Load the stage_finetuing sparse model and Tokenizer in output_llama7b_2_4_w4a16_channel/stage_finetuning
, and follow the HuggingFace process to calculate perplexity"
I want to correctly load the local sparse model and calculate its perplexity as an evaluation metric. However, it seems that I havenāt used the correct method to load the model (through the SparseAutoModelForCausalLM
class) or the Tokenizer. If there are any documents or resources I can refer to, please let me know. Thanks! š„°
And my testing jupyter notebook is in attatchment. Perplexity of model.zip
š Hello Neural Magic community developers,
I encountered an issue while calculating the perplexity for a locally converted Llama3-8B sparse model using the llm-compress library. I'm refer the sparse conversion example script and change model to meta-llama/Meta-Llama-3-8B-Instruct by my self, the sparse conversion need ~ 1.2 hours to finish. Hereās a detailed breakdown:
Describe the bug While trying to compute the WikiText2 Perplexity for a Llama3-8B model that has been sparsified (load local model from disk), the resulting perplexity values always turn out to be NaN. I suspect that some configurations might not be properly set when using the custom SparseAutoModelForCausalLM class in combination with the compressed-tensors library.
Expected behavior I expected the perplexity values to be reasonable and comparable to the official Hugging Face models. For example, when testing with the standard Llama-3.2-3B model from Hugging Face (without sparsification), I got a perplexity of around ~8.8 with the following parameters:
I expected similar results for the sparse model, not NaN values.
Environment I use RunPod online env with A100-80GB-SXM *2
To Reproduce Steps to reproduce the behavior:
Errors Hereās the output I receive when running the perplexity calculation, see the attachment image. The perplexity of local Llama-8B model (load by SparseAutoModelForCausalLM class) always be NaN value. Test with Llama-3B model (load by AutoModelForCausalLM class) can successfully get ppl value.
Sparse Llama 8B (load by SparseAutoModelForCausalLM class) : ppl will be NaN
Load Online Llama 3B (load by AutoModelForCausalLM class) : successfully get ppl value
Additional context The same perplexity calculation process works perfectly when using the Hugging Face Llama-3.2-3B model without sparsification, which gives a perplexity value of ~8.8. I believe the issue lies in either the custom sparse model class or the integration with compressed-tensors. Maybe I miss some additional configuration/setting of Sparse model ? š§ Any guidance on this would be appreciated! š„°
Additional Question How to load the final quantization model (i.e the model be saved in _stagequantization folder) correctly ? I also interest of ppl of final quantization model, but I try load with SparseAutoModelForCausalLM it can not be work š¢ it shows some message mean : "... ... class not support ..." So how to load the final quantization model correctly ? any documentation can be refer ? šš¼