Is it normal to have different test results on same model and data?

iamanigeeit commented 3 months ago

Hello @tianyic,

I was running the sanity_check tests on test_convnexttiny.py and got different results despite using a fixed dummy_input.

dummy_input = 0.5 * torch.ones(size=(1, 3, 224, 224), dtype=torch.float32)

Test Run 1

Maximum output difference :  1.452776312828064
Size of full model        :  0.10655930824577808 GBs
Size of compress model    :  0.01566738821566105 GBs
FLOP  reduction (%)       :  0.5265925908672965
Param reduction (%)       :  0.8535776291678294

Test Run 2

OTO graph constructor
graph build
Maximum output difference :  1.4196803569793701
Size of full model        :  0.10655930824577808 GBs
Size of compress model    :  0.02001185156404972 GBs
FLOP  reduction (%)       :  0.5471079567314201
Param reduction (%)       :  0.8127864514599561

Test Run 3

OTO graph constructor
graph build
Maximum output difference :  1.5195997953414917
Size of full model        :  0.10655930824577808 GBs
Size of compress model    :  0.04547972418367863 GBs
FLOP  reduction (%)       :  0.4412873512719674
Param reduction (%)       :  0.5736049577741684

The difference is quite big so i want to ask if it's normal.

tianyic commented 3 months ago

@iamanigeeit

In general, if all minimally removal structures are zero-invariant, then the output deviations should be nearly as 0, as you might have seen in other DNN sanity checks. (Reference Section 3 in OTOv3 manuscript).

For convnexttiny, we provided two cases of sanity checks, one for pretraining kept, another one for vanilla network. The difference is disabling one singleton nn.Parameter, i.e., gamma and avoiding a few node groups from pruning. In my memory, both versions should yield the nearly closed to zero deviation. But when I double checked, the pretraining is fine, yet the vanilla one yield similar amount deviation as yours. It is indeed a bit abnormal.

My gut feeling is due to the vanilla network considers the consideration of pruning over grouped conv, as I mentioned in another issue. It may have some places broken in some previous commit to such large difference. :(

In addition, for transformer, all of MLP layers that I met are zero-invariant. Attention layers are case-by-case, e.g., Bert's is zero-invariant, LLAMA is not, TNLG is Phi-2 is, etc.

iamanigeeit commented 3 months ago

@tianyic

I should only care about the maximum output difference, right? It's around 1e-6 or less.

For the pretrained model, i still get variation in flop / param reduction.

Model	FLOP	Param
ConvNextTiny	15-30%	30-55%
ResNet18	55-80%	65-90%

If it's normal, i can close the issue.

tianyic / only_train_once

Is it normal to have different test results on same model and data? #63