can't reproduce result - Githubissues

microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications

MIT License

354 stars 31 forks source link

can't reproduce result #127

Closed MrGGLS closed 4 months ago

MrGGLS commented 5 months ago

Hello, I used the following configuration for the slice operation:

python run_slicegpt_perplexity.py \
    --model microsoft/phi-2 \
    --model-path .../phi-2 \
    --final-orientation pca \
    --cal-nsamples 1024 \
    --cal-max-seqlen 2048 \
    --save-dir ./test \
    --cal-dataset alpaca \
    --sparsity 0.3 \
    --cal-batch-size 4 \
    --no-wandb \
    --device cuda

However, the avg per. I obtained was only 55, while the result in the paper is 63. I am using lm-eval version 0.4.0, which should be the same as the version the project depends on (I have tried both PCA and Random orientations).

nailimixaM commented 5 months ago

That's odd - can you report the individual task accuracies? Are you able to reproduce this test result when slicing on wikitext2 at 20%?

MrGGLS commented 5 months ago

I suspect this might be an issue with the version of lm_eval. The results from the 0.4 series are generally lower. I switched to version 0.3.0 (used by the Hugging Face Open Leaderboard) and obtained the following results:

method	params	ppl	piqa	wg	hs	arc-e	arc-c	avg per.
llama2-7b	6.7		0.772	0.6709	0.7291	0.5345	0.4078	0.6228
phi-2-2.7b	2.7		0.7943	0.7609	0.736	0.7854	0.5418	0.7236
slicegpt-llama2-0.3(alpaca)	5.29	3.2465	0.7116	0.5943	0.5377	0.5316	0.3601	0.547
slicegpt-phi-0.3(alpaca)	2.09	3.3796	0.7443	0.6212	0.5342	0.3865	0.6713	0.5914

(the orientation is random) Although the results for phi-2 are close to those in the paper, the results for llama2-7b are much worse. Moreover, the results after slicing also seem less than ideal.

MrGGLS commented 5 months ago

By the way, I passed test_experiments.py

MrGGLS commented 5 months ago

Hi @nailimixaM, I tried different variables and finally found that the issue is with the batch size and the alpaca dataset.

The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).

Additionally, there might be an issue with the alpaca dataset, as I can only reproduce the correct results on wikitext2.

I hope you can investigate the problem. 😃

MrGGLS commented 5 months ago

I think I know the reason why the results cannot be reproduced. The average performance of the phi-2-sparsity-30 (alpaca) without RFT, listed in Appendix A.5, is not correct...

nailimixaM commented 5 months ago

@MrGGLS

The provided lm_eval only gives correct evaluation results when bs=1 (I suspect it's a bug in lm_eval).

For the paper I ran these experiments with batch size > 1, some noise is expected between different batch size use but the results should be largely the same. Can you confirm you're using the lm_eval version from our .toml file?

MrGGLS commented 5 months ago

@nailimixaM , yes, I used the lm_eval provided by you😂. However, I can get the correct results now.

yaya-sy commented 3 months ago

@nailimixaM so if I need to reduce the model to -20% of its parameters, I need to use a slicing ratio of 30%? This is what I understand from the screenshots of @MrGGLS, where 30% of slicing corresponds to a -20% of parameters reduction. Correct me if I misunderstood something.

Related to this issue: #165