susan0199 / StacNAS

18 stars 12 forks source link

some small questions about paper #1

Closed jjjjjie closed 5 years ago

jjjjjie commented 5 years ago

Hello! An awesome work. But I have 2 questions about StacNAS, hope you can give me answers. Thank you!

  1. What's the mean of state 3 in Algorithm 1? I can't find any introduction about this state. State 3 in Figure 1 is architecture evaluation part but state 3 in Algorithm is still in architecture search part, they are different. And in the experiment part, you said: 'using 14 cells for the first stage and 20 cells for the second stage', but what about state 3?
  2. You said you use first-level optimization but neither Google nor this paper can find a definition of this. Could you explain to me what this means? I think it's similar with first order optimization in original DARTS. If I'm wrong, hope you can correct it. Thanks in advance!
susan0199 commented 5 years ago

Hi,thanks.

  1. It is an optional way of training the final architecture. (I've somehow decided to not to include in the paper due to the length constraint, I found it only works sometimes. ) In that part, I was trying to say that when the redundant operations are pruned and the best one is selected, it is optional to use the best one as a single operation or use it mixed with the Zero operation. Since Zero could serve as a path scaling/attention factor, so when using it mixed with Zero, one could learn the relative importance between the two paths kept for one node in the final architecture.
  2. The main difference between one-level and two-level optimization is the way they use the data. For two-level optimization, one needs to split the data to the validation set and training set. When updating w, we use the training mini-batches and when update alpha, we use the validation mini-batches, alpha and w are updated iteratively. For one-level optimization, one update w and alpha jointly with the gradients calculated from same training mini-batches.

Hope it clarifies.

jjjjjie commented 5 years ago

Thanks a lot. The answer of question 2 is very clear. As for question 1, I still have a little confused. The operations in state 2 is the group operation (like Sep conv selected in state 1, so in state 2 is sep conv 33 and sep conv 55) without Zero op, while operations in state 3 is only one single operation selected in state 2 and Zero operation (only two ops) in every edge, Am I correct? So state 2 is used to select the best op and state 3 is used to select the best edge?

susan0199 commented 5 years ago

Stage 2 includes zero op. After the pruning of stage 2, both edges and operations are selected exactly like DARTS. Stage 3 here proposed an alternative way of using the final architecture, where zero is kept to learn the relative importance of the selected two edges for each node.

jjjjjie commented 5 years ago

Thanks for your reply. I still have a question about one-level optimization. If I use one-level optimization to update original DARTS, will the acc increase? one-level optimization seems to be a normal BP process, except that the learning rate is different. this optimization method is much simpler than two-level. So is two-level optimization a redundant operation? Does it have any merit?

susan0199 commented 5 years ago

Two-level optimization produces results with high variance unless search four times and pick the best like the DARTS. Empirically, repeated experiments show that the average result of one-level is higher than two-level (train 80 epochs, with all 50k training data for CIFAR10). Two-level optimization is supposed to prevent overfitting. However, the overfitting problem for overparameterized NN+ SGD seems have not been understood well.

jjjjjie commented 5 years ago

Thank you very much for your clarification.