Reproducing paper results

horseee / LLM-Pruner

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.

https://arxiv.org/abs/2305.11627

Apache License 2.0

836 stars 98 forks source link

Reproducing paper results #34

Open grigorn opened 10 months ago

grigorn commented 10 months ago

I run LLM pruner with the command specified in the ReadMe to prune LLama-7B

python hf_prune.py --pruning_ratio 0.25 \
      --block_wise \
      --block_mlp_layer_start 4 --block_mlp_layer_end 30 \
      --block_attention_layer_start 4 --block_attention_layer_end 30 \
      --pruner_type taylor \
      --test_after_train \
      --device cpu  --eval_device cuda \
      --save_ckpt_log_name llama_prune

I get the following results

#Param before: 6738415616, #Param after: 5422977024, Ratio = 80.4785%
PPL after pruning: {'wikitext2': 19.96819234893607, 'ptb': 80.37625124290746}

Perplexities reported in Table 1 in the paper are WikiText2 - 19.09 and PTB - 34.21. Is there any reason for the difference in thses perplexities especially PTB? Thanks

horseee commented 10 months ago

Hi. Can I check which LLaMa-7B checkpoint you use? decapoda-research/llama-7b-hf in my code is not available currently and I'm not sure if it is the reason that causes this difference.

grigorn commented 10 months ago

I am using 'yahma/llama-7b-hf'

horseee commented 10 months ago

Have you tried the copied version of decapoda-research/llama-7b-hf, e.g., https://huggingface.co/baffo32/decapoda-research-llama-7B-hf?

We would try that kind of checkpoint these days to see if the results are reproducible in those available checkpoints.

grigorn commented 10 months ago

With the checkpoint you specified, I could replicate the metrics. Do you know what is the difference between those 2? I thought there is one LLama and the checkpoints should be the same

horseee commented 10 months ago

I have no idea about this😢. I guess the possible reasons may be: (1) the EOS token issue or (2) the weight between these two is slightly different.

grigorn commented 10 months ago

I checked both the model and the tokenizer. Model weights and tokenizer.get_vocab() are the same, but there is the difference of special tokens - for baffo32 all three special tokens are empty strings. Can this be reason of these differences? If yes, do you know which one is the "true" LLama? Screenshot 2023-11-23 at 17 40 53