I was learning source code of Darknet, and I found some issues in batchnorm_layer.c:
1, in backward_batchnorm_layer(), according to the formulations in Batch norm paper, variance_delta_cpu() should be before mean_delta_cpu(), because dl/d_mean depends on dl/d_var.
2, mean_delta_cpu(), which computes dl/d_mean, I modified it like this:
void mean_delta_cpu(float delta, float variance, int batch, int filters, int spatial, float mean_delta, float variance_delta, float x, float mean)
{
int i,j,k;
for(i = 0; i < filters; ++i){
mean_delta[i] = 0;
float sum = 0;
for (j = 0; j < batch; ++j) {
for (k = 0; k < spatial; ++k) {
int index = jfiltersspatial + ispatial + k;
mean_delta[i] += delta[index];
sum += x[index] - mean[i];
}
}
sum = (-2.) / (spatialbatch);
mean_delta[i] = (-1./sqrt(variance[i] + .00001f));
mean_delta[i] += variance_delta[i] sum;
}
}
Because mean delta is also related to variance, I corrected it based on my understanding and tested it on Cifar using my own model, however, the result is almost the same, no improvement. Anyway, epsilon is small as well, but you add it here, so I think variance_delta[i] sum should not be omitted.
By the way, in blas.c, normalize_cpu(), according to formulation in Batch norm paper, (sqrt(variance[f]) + .000001f); should be (sqrt(variance[f] + .000001f));
I was learning source code of Darknet, and I found some issues in batchnorm_layer.c:
1, in backward_batchnorm_layer(), according to the formulations in Batch norm paper, variance_delta_cpu() should be before mean_delta_cpu(), because dl/d_mean depends on dl/d_var.
2, mean_delta_cpu(), which computes dl/d_mean, I modified it like this: void mean_delta_cpu(float delta, float variance, int batch, int filters, int spatial, float mean_delta, float variance_delta, float x, float mean) { int i,j,k; for(i = 0; i < filters; ++i){ mean_delta[i] = 0; float sum = 0; for (j = 0; j < batch; ++j) { for (k = 0; k < spatial; ++k) { int index = jfiltersspatial + ispatial + k; mean_delta[i] += delta[index]; sum += x[index] - mean[i]; } } sum = (-2.) / (spatialbatch); mean_delta[i] = (-1./sqrt(variance[i] + .00001f)); mean_delta[i] += variance_delta[i] sum; } } Because mean delta is also related to variance, I corrected it based on my understanding and tested it on Cifar using my own model, however, the result is almost the same, no improvement. Anyway, epsilon is small as well, but you add it here, so I think variance_delta[i] sum should not be omitted.
By the way, in blas.c, normalize_cpu(), according to formulation in Batch norm paper, (sqrt(variance[f]) + .000001f); should be (sqrt(variance[f] + .000001f));