Results don't match - Githubissues

jeohalves commented 3 months ago

Greetings,

I've done some experiments with PrunerZero and Wanda and saw that there are some results that don't match with the paper. Please find below the obtained results:

Method	BoolQ	RTE	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Mean
Wanda	76.61%	53.07%	52.70%	67.96%	72.31%	38.91%	30.80%	56.05%
Wanda (norm)	76.61%	53.07%	70.93%	67.96%	69.19%	43.00%	43.20%	60.57%
PrunerZero	70.37%	53.07%	51.15%	66.22%	71.72%	36.69%	28.00%	53.89%
PrunerZero (norm)	70.37%	53.07%	68.90%	66.22%	67.89%	39.08%	40.80%	58.05%

And the results from the paper:

I've put in a new line the results from tasks which had a normalized accuracy (red and purple). I only repeated the accuracy for tasks that didnt have it, which are in yellow in the screenshot.

Maybe there was a mistake, which only for PrunerZero the normalized accuracy was reported. Can you guys check it?

Best regards!

jeohalves commented 1 month ago

Is there any update regarding this issue?

pprp commented 1 month ago

Hi, sorry for the late reply. We employ the higher one from norm and non-norm results.

Basically, your results are the same with ours. Due to the difference in CUDA, GPU, and different devices, there should be some deviation, which seems acceptable.

Here is my recipe:

CUDA 12.0
Python 3.9
A6000

Best regards

jeohalves commented 1 month ago

I'm sorry, but this is clearly wrong. If you use the higher one for Pruner-Zero, you should also apply the same rule for other methods (like Wanda). As we can see, Wanda had a mean of 60.57% using the normalized accuracy. Other works didn't used the normalized accuracy. PrunerZero should be better than using only the magnitude, but it's worse than SparseGPT and Wanda.

pprp commented 1 month ago

Thank you for pointing that out. I will recheck it these days. Maybe using the downstream tasks performance as fitness is a better way.

pprp / Pruner-Zero

Results don't match #3