melodyguan / enas

TensorFlow Code for paper "Efficient Neural Architecture Search via Parameter Sharing"
https://arxiv.org/abs/1802.03268
Apache License 2.0
1.58k stars 390 forks source link

How does parameter sharing work #42

Closed harewei closed 6 years ago

harewei commented 6 years ago

Sorry that this isn't actually an issue with the code, but just a question which I'm unable to figure out by reading the paper and code.

I'm trying to understand how the parameter sharing works in ENAS. The first two questions are there partially to answer the third main question.

  1. Are all nodes only used ONCE during macro search?
  2. For macro search, will all the nodes definitely link to its previous node? (seems so with inputs=prev_layers[-1])
  3. How are the parameters shared? Does each operations have their own weights, which are always loaded when called (e.g. Conv2D 3x3 has a weight that is applied each time it's called)? If this is the case, then which weight is used to update and memorize during training, assuming multiple instances of the same operation is used. Or are there weights for each unique connection, e.g. Node1 to Node3 (W13) has one weight set, Node2 to Node3 (W23) has another weight set. If so, then how does it handle cases when there are skip connections (e.g. Node1 and Node2 are concatenated, which are then passed to Node 3. Will it have W12-3?)?
harewei commented 6 years ago

I've gone through the code quite a few times, so I guess I'll answer these myself, in case anyone sees this in the future.

  1. No, they can appear multiple times.
  2. Yes.
  3. Store the weights of all possible nodes in each layer, extract the weight from them when initializing new networks.
manhquang144 commented 6 years ago

Hi@harewei : could you please elaborate your above explainations, i have struggled understanding this paramter sh aring too . Thank you ~

harewei commented 6 years ago

@manhquang144 The weights of particular node operation in particular layer is shared. So for example, you have 2 node operation, Conv2D and Maxpool (it's 6 for the original ENAS implementation). Say you are trying to make a network of 2 layers.

In the first layer, you have layer1-Conv2D and layer1-Maxpool. In the second layer, you have layer2-Conv2D and layer2-Maxpool and so on (if you want a larger network.

The first child network you create uses, say, layer1-Conv2D and layer2-Conv2D. If the second child wants a Conv2D followed by Maxpool, it will call and share the weight of layer1-Conv2D used by the first layer, then call layer2-Maxpool (which isn't used by child 1, but if other networks also uses Maxpool at the 2nd layer, the weights will be the same as this).

manhquang144 commented 6 years ago

@harewei : Thank you so much Harewei, not I got the idea on how it works

remykarem commented 5 years ago

@harewei, how does parameter sharing work in skip connections for CNN macro search?

maurizio-zen commented 5 years ago

@harewei , Thank you for the detailed explanation above. I just have one more question. Suppose that we have 3 nodes, each having Conv2D and Maxpool ops. Further assume that the controller generates child model #1 which looks like this: Layer1-Conv2D => Layer2-Conv2D => Layer3-Maxpool. Now suppose that the controller generates child model #2 which looks like this: Layer1-Conv2D => Layer2-Conv2D + Layer1-Conv2D => Layer3-Maxpool (i.e. the first two layers are the same as in child model #1, but we connect both the first and second layer to the third layer using skip connections). In this scenario, child model #2 will share the weights for Layer1-Conv2D. But how about the weights for Layer3-Maxpool? I guess my dilemma is: Does weight sharing depend on skip connections?

wmcnally commented 4 years ago

@maurizio-zen MaxPool does not use weights, so in that case, it would not a problem.

@harewei, are all shared weights (even the ones that aren't being used) held in GPU memory while training the controller? If so, how much memory does that consume, for CIFAR-10? Also, is the number of channels constrained within each cell/layer? If not, there could be many different possible weight tensors for Layer1-Conv2D, for example?