princeton-vl / DecorrelatedBN

Code for Decorrelated Batch Normalization
BSD 2-Clause "Simplified" License
82 stars 9 forks source link

keras layer implementation #2

Closed choltz95 closed 6 years ago

choltz95 commented 6 years ago

Hi, I reproduced this layer in Keras, but i am getting the wrong result. I though that my implementation looks fine...does anyone see any obvious issues? Additionally, the eigenvalue decomposition is very slow. Do you have advice to speed it up? The input to the layer is a tensor of dimension (batch_size height width channels). Here are the equations:

input_shape = K.int_shape(inputs) # (batch_size height width channels) 
# unroll all dimensions except feature maps dim (c X hwb)
pool_shape = (-1, input_shape[-1]) 
x = K.reshape(x,pool_shape)
x = K.permute_dimensions(x, (1,0)) #if you do want to invert the dimensions

mean = K.mean(x,1,keepdims=True)     

# standard batch norm
#stddev = K.std(x,1,keepdims=True) + self.epsilon
#normed = (x - mean) / stddev
#normed = K.reshape(normed,((-1,)+ input_shape[1:]))

# center inputs
centered_inputs = x - mean 

#vvvvvERROR SOMEWHERE IN HEREvvvvv#
# compute covariance matrix for reshaped inputs xxt
covar = K.batch_dot(K.expand_dims(x, axis=-1), K.expand_dims(x, axis=-1),axes=(2,2))
# fuzz covariance matrix to prevent singularity
covar = covar + self.epsilon 

# execute eigenvalue decomposition
#Lambda, D,_ = tf.svd(covar,compute_uv=True)
Lambda, D = tf.self_adjoint_eig(covar)
Lambda = tf.linalg.diag(Lambda)

# calculate PCA-whitening matrix 1/sqrt(L) * D^T
U = K.batch_dot(1. / K.sqrt(Lambda), D, axes=(2,2))
# calculate PCA-whitened activation x_a = U(x - \mu)
x_a = K.batch_dot(U, centered_inputs,axes=(2,1))
# calculate ZCA-whitened output Dx_a
x_whitened = K.batch_dot(D, x_a)
#^^^^^ERROR SOMEWHERE IN HERE^^^^^# 

# reshape whitened activations back to input dimension
x_normed = K.permute_dimensions(x_whitened,(1,0)) # permute back to (bhw X c)
x_normed = K.reshape(x_normed,((-1,), input_shape[1:])) # reroll dimensions
AliaksandrSiarohin commented 6 years ago

It is not clear why you use batch_dot for computing covariance matrix, but it seems like you forget to divide by the number of samples in a batch. You can check my implementation https://github.com/AliaksandrSiarohin/gan/blob/4a3253c1f077ce97a806d59f86f2c7b961fe5a56/conditional_layers.py#L606. It is a bit messy, but it seem like it works. Slow svd decomposition is well known problem of tensorflow. For example check https://github.com/tensorflow/tensorflow/issues/13222. You can try to run it on cpu: with tf.device('cpu'): Lambda, D = tf.self_adjoint_eig(covar) But it will still be slow.

choltz95 commented 6 years ago

Thank you for your help. I misunderstood batch_dot() in keras. I was wondering if you managed to replicate the performance of the paper with your implementation? Even with scaling & group normalization I have not been able to achieve superior performance compared to standard batch norm.

AliaksandrSiarohin commented 6 years ago

The improvement is very marginal. You will not able to see it unless you do an average over 5 runs. For cifar10 classification I only try to experiment with whitening based on cholesky decomposition. It gives the same marginal improvement e.g. 7.3(whitening) vs 7.0(batch-norm) for res32.

JaeDukSeo commented 6 years ago

I would be caution to use self_adjoint_eig since there is an error when using tf eig https://github.com/tensorflow/tensorflow/issues/16115

And it makes sense, if the cost function is not set, there are no derivative respect to eig value and vector.