Open wuzhi19931128 opened 4 years ago
Let's talk about optimization difficulty first. Increasing width makes optimization easier, more skip makes optimization easier and increasing depth makes optimization harder. Due to the memory constraint in DARTS, we have to decrease the width from 36(augment)to 16(search), this increase the optimization difficulty. So if we still use 20 as depth, the optimization difficulty during search would be larger than augment, this makes the system want to choose more than enough skip-connection to balance the difficulty. To chose the proper number of skip automatically, we need to control the difficulty come from width*depth the same. Here comes in the gradient confusion to measure this difficulty.
Hope this explains.
In paper,you said after 100epoch then you can measurement gradient confusion, which can be used in stage1. BUT the final arch is searched after stage1 .2... If your means is search using 8cell firstly and get an arch,then using gradient confusing to get 14 is best and search again?
Hi. The gradient confusion of the final architecture is represented by a randomly generated architecture instead of the 'best' architecture.
thanks,there‘s one more question.Is the gradient confusion of stage1 also obtained by random pick ops and train 100epochs.?So two subnets are similar but supernet may has different gradient confusion, am i right?
I found a problem hidden in DARTS,20 layers and 36channels may not suitable for all cell, so i want use gradient confusion to found better structure. I think it's more reasonable to change the final network according the cell searched by DARTS.
for the first stage, we have only one possible supernet. which is sep33,dil33,maxpool,skip. Honestly,I don't think gradient confusion would help to do that. It is just a rough measure on the optimization difficulty. And the approximation of it might subject to variance. There should be better tools to achieve your goal.
I notice that there is a issues about gradient confusion, I don't understand how gradient confusion can help you decide the number of layers. Can you explain it?
The second question is that you point out the network can decide the skip_connect number automatically. However, choosing skip_connect is a over fitting phenomenon. IF use ImageNet dataset ,usually it has only 0 or 1 skip_connect in norm cell,which lead to FLOPS over 600M.So how can you control it?
Thanks if you notice my question.