Open ziqi-zhang opened 1 year ago
Thanks for your attention. That's a good question.
I tried once-for-all in CVPR 2022 NAS workshop, and found that the progressive shrinking strategy used in once-for-all is actually a long pipeline, which includes elastic resolution, kernel size, depth, and width. Each stage have different hyperparameters. And in this repo, we change the dataset from large scale imagenet to cifar10 and the hyperparamter might not work as before. As reported in your experiments, the drastic dropping in performance may attribute to improper hyperparamter setting. There are some possible solutions:
learning rate
by 10 or by 100.Besides, I prefer sandwich rule
proposed in BigNAS, which have less hyperparameters and can converge faster that the progressive shrinking strategy.
Let me know if there are any new progress.
Thanks very much for your quick and detailed answer! I guess I didn't correctly load the pre-trained model, and I will rerun the code to check the results. I will update this issue if I get any new results.
BTW I saw you commit message that "autoaugment 影响训练集非常大". What does it mean? Does it mean the autoaugmentation techniques can improve the final accuracy? Besides, the original OFA repo seems doesn't have these autoaugmentations?
Hi, I found that after initializing net
with the weights of pre-trained teacher (except some mismatched weights), top1 accuracy increases to about 70%.
@ziqi-zhang
As for the influence of autoaugmentation, I did test the it and achieve 89% training accuarcy and 81% valid accuracy. The capacity of current ofa model is silightly larger than the size of cifar dataset, which means that adopting more data with diversity would boost the performance.
And it 's great to hear that loading pretraining model can alleviate the performance dropping about 20%👍. You can try autoaugmentation or tune the hyperparameters in the next step.
@pprp Thanks very much for your explanation! BTW I read the sandwich rule
in BIGNAS, but I have a small question: is the sandwich rule
progressive (like OFA) or one stage?
As we know, OFA needs to train for four stages (resolution, kernel, depth, and width). But the sandwich rule
seems not to have this requirement. It only trains once, and for each iteration, the sandwich rule
samples the largest, smallest, and some random intermediate chile models.
If that is the case, the sandwich rule
is much more convenient than OFA (one stage v.s. four stages). But I guess the total training time of the sandwich rule
should be comparable to the sum of the time of four stages of OFA?
@ziqi-zhang From my experiments, I think sandwich rule
should be more quick than OFA because of the inplace distillation
. Inplace distillation is quiet usefull during the CVPR NAS Workshop.
Hi,
Thanks for sharing the awesome code with us. I tried to run the code but got low accuracy, so I was wondering whether you met a similar problem.
I successfully trained the teacher model and get a val top1 accuracy of 91%. Then I run
python train_ofa_net.py --task kernel
to train the elastic kernel. But I only got a top1 accuracy of 52%, which is far from 91%. How can I improve the accuracy?Best Regards