human-analysis / neural-architecture-transfer

Neural Architecture Transfer (Arxiv'20), PyTorch Implementation
http://hal.cse.msu.edu/papers/neural-architecture-transfer/
150 stars 21 forks source link

Are the supernet weights trained during the search? #7

Closed AwesomeLemon closed 3 years ago

AwesomeLemon commented 3 years ago

Hello,

I was wondering whether the weights of the supernetwork are continuously trained during the search?

I noticed that in the code of your previous paper (NSGANetV2) that you reference in another issue, the supernet is not actually trained during the search; instead, the supernet weights are used for initializing subnet weights, which are trained for 5 epochs, used for evaluation, and are then discarded; the next subnet is initialized again with the original supernet weights.

Which is why I'd like to know whether NAT does that too?

An additional question: if NAT doesn't discard the trained weights, how do you deal with the fact that the performances in the archive were reported based on older weights? Doesn't this negatively impact the predictor's accuracy?

Thanks in advance!

gautamsreekumar commented 3 years ago

Hello,

The supernet weights are updated during the training.

  1. We evaluate all the architectures in the archive and train the accuracy predictor on them.
  2. This accuracy predictor will aid the evolutionary search algorithm, which results in a new batch of off-spring architectures that are added to the archive.
  3. Subnets are sampled according to the categorical distribution of the encodings in the archive.
  4. After the supernet adaptation stage, which is 5 epochs, all the architectures in the archive are evaluated again.
  5. This cycle continues for 30 iterations.
AwesomeLemon commented 3 years ago

Thank you for the clarifications!

But your explanation makes me wonder about the number of evaluations: if all 300 architectures from the archive are evaluated on each iteration, by the end of the search 300 * 30 = 9000 evaluations will be performed, which is much more than reported in Fig. 14. Could you help me understand the discrepancy? Perhaps you only count the number of unique architectures?

mikelzc1990 commented 3 years ago

Fig.14 corresponds to the ablative experiment we did for comparing the relative search efficiency under a bi-obj scenario. The compared methods originally use different means to gauge and select archs during search, e.g, NSGANet use proxy tasks constructed by down-scaling archs and reducing training epochs, etc; and NAT use supernet. To make a fair comparison under the same x-axis of # of archs required to reach certain hv (y-axis), we provide the trained supernet (on each of the three datasets) to all three methods. Then the random search simply uniformly samples archs from the search space, while NSGANet uses genetic operations (crossover + mutation + EDA) to generate archs. And both methods query the supernet (i.e., # of archs eval.ed) for every archs created. NAT instead builds an accuracy predictor from the initial population, then only evaluates a small subset from the candidate pool returned by NSGA-III on the accuracy predictor. All methods start with an initial population of 100 random archs, and we terminate the search when NSGANet or random search catch-up with NAT. We are sorry about the confusion.

AwesomeLemon commented 3 years ago

Ah, thanks for the explanation. Correct me if I'm wrong, but is it true that in this scenario the supernet weights are not trained continuously (when using NAT)? From your description it seems to me that this experiment is the same one as in the NsgaNetV2 paper Fig. 5, where the supernet weights are fixed and are simply used for initializing the weights of the offspring in each generation.

mikelzc1990 commented 3 years ago

Ah, thanks for the explanation. Correct me if I'm wrong, but is it true that in this scenario the supernet weights are not trained continuously (when using NAT)? From your description it seems to me that this experiment is the same one as in the NsgaNetV2 paper Fig. 5, where the supernet weights are fixed and are simply used for initializing the weights of the offspring in each generation.

This experiment is essentially the same as the one in NSGANetV2 paper, except the choice of the accuracy predictor. The supernet is trained already for all three datasets and are used to assist the validation of the search components only.

AwesomeLemon commented 3 years ago

Great, thanks again!