Elastic Kernel Accuracy on CIFAR10

ziqi-zhang commented 1 year ago

Hi,

Thanks for sharing the awesome code with us. I tried to run the code but got low accuracy, so I was wondering whether you met a similar problem.

I successfully trained the teacher model and get a val top1 accuracy of 91%. Then I run python train_ofa_net.py --task kernel to train the elastic kernel. But I only got a top1 accuracy of 52%, which is far from 91%. How can I improve the accuracy?

Best Regards

pprp commented 1 year ago

Thanks for your attention. That's a good question.

I tried once-for-all in CVPR 2022 NAS workshop, and found that the progressive shrinking strategy used in once-for-all is actually a long pipeline, which includes elastic resolution, kernel size, depth, and width. Each stage have different hyperparameters. And in this repo, we change the dataset from large scale imagenet to cifar10 and the hyperparamter might not work as before. As reported in your experiments, the drastic dropping in performance may attribute to improper hyperparamter setting. There are some possible solutions:

Check the checkpoint loading parts to varify whether the pretrained model are loaded properly.
Try to decay the learning rate by 10 or by 100.

Besides, I prefer sandwich rule proposed in BigNAS, which have less hyperparameters and can converge faster that the progressive shrinking strategy.

Let me know if there are any new progress.

ziqi-zhang commented 1 year ago

Thanks very much for your quick and detailed answer! I guess I didn't correctly load the pre-trained model, and I will rerun the code to check the results. I will update this issue if I get any new results.

ziqi-zhang commented 1 year ago

BTW I saw you commit message that "autoaugment 影响训练集非常大". What does it mean? Does it mean the autoaugmentation techniques can improve the final accuracy? Besides, the original OFA repo seems doesn't have these autoaugmentations?

ziqi-zhang commented 1 year ago

Hi, I found that after initializing net with the weights of pre-trained teacher (except some mismatched weights), top1 accuracy increases to about 70%.

pprp commented 1 year ago

@ziqi-zhang

As for the influence of autoaugmentation, I did test the it and achieve 89% training accuarcy and 81% valid accuracy. The capacity of current ofa model is silightly larger than the size of cifar dataset, which means that adopting more data with diversity would boost the performance.

And it 's great to hear that loading pretraining model can alleviate the performance dropping about 20%👍. You can try autoaugmentation or tune the hyperparameters in the next step.

ziqi-zhang commented 1 year ago

@pprp Thanks very much for your explanation! BTW I read the sandwich rule in BIGNAS, but I have a small question: is the sandwich rule progressive (like OFA) or one stage?

As we know, OFA needs to train for four stages (resolution, kernel, depth, and width). But the sandwich rule seems not to have this requirement. It only trains once, and for each iteration, the sandwich rule samples the largest, smallest, and some random intermediate chile models.

If that is the case, the sandwich rule is much more convenient than OFA (one stage v.s. four stages). But I guess the total training time of the sandwich rule should be comparable to the sum of the time of four stages of OFA?

pprp commented 1 year ago

@ziqi-zhang From my experiments, I think sandwich rule should be more quick than OFA because of the inplace distillation. Inplace distillation is quiet usefull during the CVPR NAS Workshop.

pprp / ofa-cifar

Elastic Kernel Accuracy on CIFAR10 #3