quark0 / darts

Differentiable architecture search for convolutional and recurrent networks
https://arxiv.org/abs/1806.09055
Apache License 2.0
3.92k stars 843 forks source link

It seems the global moving average is used as opposed to what the original paper instructed #85

Open bombs-kim opened 5 years ago

bombs-kim commented 5 years ago

According to the original paper

  1. (In architecture search) we always use batch-specific statistics for batch normalization rather than the global moving average.
  2. Learnable affine parameters in all batch normalizations are disabled during the search process

The second statement holds true in this implementation, but I couldn't find related code for the first statement. According to my observation, all batch normalizations use global moving average. For example, one batch norm layer has this form, BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=False)

I think momentum (or decay in fact) should be set to 1 in the above to be consistent with the paper.