Closed jjjjjie closed 5 years ago
Hi,thanks.
Hope it clarifies.
Thanks a lot. The answer of question 2 is very clear. As for question 1, I still have a little confused. The operations in state 2 is the group operation (like Sep conv selected in state 1, so in state 2 is sep conv 33 and sep conv 55) without Zero op, while operations in state 3 is only one single operation selected in state 2 and Zero operation (only two ops) in every edge, Am I correct? So state 2 is used to select the best op and state 3 is used to select the best edge?
Stage 2 includes zero op. After the pruning of stage 2, both edges and operations are selected exactly like DARTS. Stage 3 here proposed an alternative way of using the final architecture, where zero is kept to learn the relative importance of the selected two edges for each node.
Thanks for your reply. I still have a question about one-level optimization. If I use one-level optimization to update original DARTS, will the acc increase? one-level optimization seems to be a normal BP process, except that the learning rate is different. this optimization method is much simpler than two-level. So is two-level optimization a redundant operation? Does it have any merit?
Two-level optimization produces results with high variance unless search four times and pick the best like the DARTS. Empirically, repeated experiments show that the average result of one-level is higher than two-level (train 80 epochs, with all 50k training data for CIFAR10). Two-level optimization is supposed to prevent overfitting. However, the overfitting problem for overparameterized NN+ SGD seems have not been understood well.
Thank you very much for your clarification.
Hello! An awesome work. But I have 2 questions about StacNAS, hope you can give me answers. Thank you!