idstcv / ZenNAS

218 stars 35 forks source link

Hi MingLin #19

Closed billbig closed 2 years ago

billbig commented 2 years ago
Your proposed Zen-NAS is a very efficient way to search for neural network structures. I read your article and GitHub code carefully, and did my own search on your code, but one thing I found is that the network structure searched almost always repeats more times the deeper the network is the network block, and the first few layers of the network block are repeated once, for example, I used your code to search the structure of MNas (the search space has been changed according to MNas), MNas0.35 optimal structure: 

SuperConvK3BNRELU(3,16,2,1)SuperResMnasV1K3(16,8,1,16,1)SuperResMnasV3K3(8,8,2,8,3)SuperResMnasV3K5(8,16,2,8,3) SuperResMnasV6K5(16,32,2,16,3)SuperResMnasV6K3(32,32,1,32,2)SuperResMnasV6K5(32,64,2,32,4)SuperResMnasV6K3(64,112,1,64,1) SuperConvK1BNRELU(112,1280,1,1) but I searched with your architecture and the structure is as follows SuperConvK3BNRELU(3,8,2,1)SuperResMnasV1K3(8,8,1,8,1)SuperResMnasV3K5(8,16,2,8,1)SuperResMnasV3K5(16,24,2,8,1)SuperResMnasV3K5(24,64,2,40,1)SuperResMnasV3K5(64,24,1,48,1)SuperResMnasV3K5(24,64,2,176,4)SuperResMnasV3K5(64,48,1,256,5)SuperConvK1BNRELU(48,2048,1, 1) so the search out of the structure compared to the original structure is not very good, the structure is still the problem mentioned above, the search out of the network shallow block duplication for 1, only the deeper network has block duplication, also tested your code in the Flops400M,600M,900M, found the same problem, this is why?

MingLin-home commented 2 years ago

Hi billbig, Thank you for your feedback! I try my best to answer your questions below:

Q1: The ZenNAS output is too deep. A1: From theory, deeper network has greater complexity. So it is important to limit the maximal depth of the network. This is a data dependent knowledge and can vary from task to task.

Q2: MNas is optimal A2: 1) I do not understand why MNas is optimal. 2) ZenNAS is a data independent method. Therefore, it never tries to find an "optimal" structure. The optimaity of a structure really depends on the task. For example, if your data is linearly separable, the optimal structure is a linear classifier. 3) ZenNAS gives you a max-capacity structure. According to the statistical machine learning theory, this often gives you a good classifier, but not 100% guaranteed. On the other hand, there is no method can be optimal in every scenario.

Q3: The output of ZenNAS is no the same as MNAS. It is deeper in high-level layers. A3: Again, it depends on your search settings. Under the constraints you given, the maximal capacity network should be like that. That means MNAS did not give you the maximal capacity structure.

Q4: Why maximal capacity is better? A4: 1) When there is no prior knowledge about the data distribution, the maximal capacity oracle is the best we can hope. 2) With data distribution prior knowledge, it is often possible to do much better. However, this is a trade-off. If you have comprehensive knowledges about your problem, you possibly do not need deep learning at all. 3) Training-based NAS methods bring in data knowledge during the search. This slow down their search speed and also limits their transferability to a totally different task.

Wish the above answered your concerns. Please do hesitate to contact us if you have more questions.

billbig commented 2 years ago

Hello Lin Ming, Thank you very much for your patient response, I very much recognize your contribution to the neural architecture search, because I did not express myself clearly, my problem remains unsolved, I redescribe my problem: The problem arises when searching for neural network structures. For example, In your experiment with Flops = 400M, you get the following architecture: SuperConvK3BNRELU(3,16,2,1)SuperResIDWE1K7(16,40,2,40,1)SuperResIDWE1K7(40,64,2,64,1) SuperResIDWE4K7(64,96,2,96,5)SuperResIDWE2K7(96,224,2,224,5)SuperConvK1BNRELU(224,2048,1,1) The total down-sampling stride in your Flops=400M structure is 32,I can think of 5 down-sampling stages, when stage=1, the block is SuperConvK3BNRELU(3,16,2,1), ... , stage=4, i.e. stride = 16, the block is SuperResIDWE4K7(64,96,2,96,5), this time the block is repeated 5 times, when stage=5 i.e. stride = 32, the block is repeated 5 times. I repeated your experiment (Flops = 400M configuration is the same as yours), i.e. searching with file ZenNAS main\scripts\Zen_NAS_ImageNet_flops400M. I searched for the following structure: SuperConvK3BNRELU(3,8,2,1)SuperResIDWE1K7(8,24,2,40,1)SuperResIDWE1K7(24,64,2,64,1) SuperResIDWE4K7(64,96,2,96,1)SuperResIDWE6K7(96,192,2,152,4)SuperResIDWE6K7(192,120,1,120,5) SuperConvK1BNRELU(120,2048,1,1) Briefly, it is stage=1,2,3,4, block repeat is only 1, all block repetitions are concentrated in the last stage (stride=32), i.e. SuperResIDWE6K7(96,192,2,152,4)SuperResIDWE6K7(192,120,1,120,5), this stage the block is repeated 9 times !! It can be summarized by a formula e.g. max_layers = n, then stage5_repeats = n-5 (5 is The first four stages are repeated once and the final SuperConvK1BNRELU is repeated once) The part of the structure that differs from yours is that your stage=4 (stride=16) has multiple blocks. And I searched several times, and the block repeated only in the last stage. Of course I also searched for 600M, 900M (exactly the same configuration as yours), but the result is still the same, the stage=5 i.e. stride=32 repeat count is still n-5! I would like to ask if this structure is normal and why it leads to such a result?

MingLin-home commented 2 years ago

Hi billlbig,

It is very normal that ZenNAS will use more layers on high level stages ( because this will maximize the model capacity with minimum FLOPs). The EA process might converge to different points. I can see that your results are different from us in the 4th stage. It is very possible that we were using the distributed EA (which is not released) which leads to different solutions. This might be an interesting discovery, that different EA settings will lead to very different convergence points.

In general, if the capacities of different models are the same, they should achieve similar top-1 accuracies. If you believe that the 4th stage is more important as a prior knowledge, you can use weighted complexity as your Zen-score. That is, use the 4-th stage Zen-score + final stage Zen-score. We have a follow-up work about this issue, please check here.

billbig commented 2 years ago

Hi LinMing, It's me again, an avid fan. I did some more experiments and found interesting phenomena. Following up on the question I posed last time, my layer repetitions are focused on the last stage i.e. feature map is 7*7(stage=5). I compared the structure I found above with the structure in your paper in terms of FLOPs. All configurations are the same as given in your source code and no_se module. e.g. --optimizer sgd --bn_momentum 0.01 --wd 4e-5 --nesterov --weight_init custom. But for the dataset I chose the first 200 classes of ImageNet 1000 for training and prediction, epoch=200. In FLOPs=400M,your structure Zen-score is 89.8 (repeat=1), my structure Zen-score is 93.65(repeat=1), but the prediction top1_acc of your model structure is 82.58%, mine is 81.08%. FLOPs=400M, the structure of my search is as follows: SuperConvK3BNRELU(3,8,2,1)SuperResIDWE1K7(8,24,2,40,1)SuperResIDWE1K7(24,64,2,64,1)SuperResIDWE4K7(64,96,2,96,1) SuperResIDWE6K7(96,192,2,152,4)SuperResIDWE6K7(192,120,1,120,5)SuperConvK1BNRELU(120,2048,1,1) In FLOPs=600M,your structure Zen-score is 136.9 (repeat=1), my structure Zen-score is 141.55(repeat=1), but the prediction top1_acc of your model structure is 75.21%, mine is 74.66%. FLOPs=600M, the structure of my search is as follows: SuperConvK3BNRELU(3,8,2,1)SuperResIDWE1K7(8,64,2,64,1)SuperResIDWE2K7(64,56,2,48,1)SuperResIDWE4K7(56,96,2,96,1) SuperResIDWE6K7(96,160,2,120,4)SuperResIDWE6K7(160,208,1,200,5)SuperConvK1BNRELU(208,2048,1,1) In FLOPs=900M,your structure Zen-score is 110.42 (repeat=1), my structure Zen-score is 117.19(repeat=1), but the prediction top1_acc of your model structure is 83.22%, mine is 81.91%. FLOPs=900M, the structure of my search is as follows: SuperConvK3BNRELU(3,8,2,1)SuperResIDWE6K7(8,24,2,32,1)SuperResIDWE2K7(24,96,2,96,1)SuperResIDWE4K7(96,128,2,120,1) SuperResIDWE6K7(128,144,2,160,3)SuperResIDWE4K7(144,224,1,328,4)SuperResIDWE4K7(224,224,1,200,4)SuperConvK1BNRELU(224,2048,1,1) I have two questions I would like to ask you. The first problem was as if the Zen-score would be higher by deepening the network in the final stage i.e. (satge=5), but it didn't work very well. The second question is that the higher the Zen-score is, the better the network prediction is not necessarily. Your paper states that the Zen-score is a score of the expressiveness of the network, can it be understood that the more linear intervals the network has, the higher the Zen-Score may be?

MingLin-home commented 2 years ago

Hi billbig,

Thank you for your feedback. We took a deep look into your problem and found that the expressivities at different scales should be considered as well (which is not consindered in Zen-score). Please check our most recent ICML 2022 paper MAE-DET (https://arxiv.org/abs/2111.13336) for more details. The source code is (here)[https://github.com/alibaba/lightweight-neural-architecture-search].

Regarding your experiments, it seems that your Zen-score is maximized, which maximizes the experessivity of the output of the last layer. In our released models, we somehow implicitly early-stop the EA process, so the Zen-score is not maximized, with the (good) side effect that the multi-scale expressivity is somehow improved. My suggestion is to use the multi-scale Zen-score ( or multi-scale entropy as in our ICML 2022 paper).

Please let me know if you have more questions! Thank you very much!

Best Regards, Ming Lin