7000 loss value when training cifar10

Hello, I am trying to train cifar10 with the nn conv2d, and dense layers, and with initialization I got 7229.39 loss at the first step (and after a while this is just 5000). I am training with the same model architecture proposed in the weight norm article (Tim Salimans, Kingma). However with older nn dense and conv2d implementation this does not happen (also when I am not using the initialization). This is the implementation I use and got proper loss (around 2.3 at the first steps):

   def conv2d(x, num_filters, filter_size=[3,3],  pad='SAME', stride=[1,1], nonlinearity=None, init_scale=1., init=False, name=''):
    with tf.variable_scope(name):
        V = tf.get_variable('V', shape=filter_size+[int(x.get_shape()[-1]),num_filters], dtype=tf.float32,
                              initializer=tf.random_normal_initializer(0, 0.05), trainable=True)
        g = tf.get_variable('g', shape=[num_filters], dtype=tf.float32,
                              initializer=tf.constant_initializer(1.), trainable=True)
        b = tf.get_variable('b', shape=[num_filters], dtype=tf.float32,
                              initializer=tf.constant_initializer(0.), trainable=True)

        if init:  # normalize x
            v_norm = tf.nn.l2_normalize(V,[0,1,2])
            x = tf.nn.conv2d(x, v_norm, strides=[1] + stride + [1],padding=pad)
            m_init, v_init = tf.nn.moments(x, [0,1,2])
            scale_init=init_scale/tf.sqrt(v_init + 1e-08)
            g = g.assign(scale_init)
            b = b.assign(-m_init*scale_init)
            x = tf.reshape(scale_init,[1,1,1,num_filters])*(x-tf.reshape(m_init,[1,1,1,num_filters]))
        else:
            W = tf.reshape(g, [1, 1, 1, num_filters]) * tf.nn.l2_normalize(V, [0, 1, 2])

            # calculate convolutional layer output
            x = tf.nn.bias_add(tf.nn.conv2d(x, W, [1] + stride + [1], pad), b)

        # apply nonlinearity
        if nonlinearity is not None:
            x = nonlinearity(x)

        return x

I've investigated the most recent implementation of nn (dense and conv2d layers). With this implementation on a tiny example the mean is just 0.001, and the variance is 0.95. With the code above I got -10^-7 and and 1.0005. Is that I am missing something here, the code in the nn library does not do the same as the code above? Here is the demo code I used for test:

import tensorflow as tf
import numpy as np

sess = tf.Session()

padding='SAME'
init=True
num_filters=96
filter_size=[3,3]
stride=[1,1]
init_scale=1.
pad='SAME'
x = tf.get_variable('x',shape=[100,32,32,3],dtype=tf.float32,
                      initializer=tf.random_normal_initializer(0,1.0), trainable=True)
V = tf.get_variable('V', shape=filter_size+[int(x.get_shape()[-1]),num_filters], dtype=tf.float32,
                      initializer=tf.random_normal_initializer(0, 0.05), trainable=True)
g = tf.get_variable('g', shape=[num_filters], dtype=tf.float32,
                      initializer=tf.constant_initializer(1.), trainable=True)
b = tf.get_variable('b', shape=[num_filters], dtype=tf.float32,
                      initializer=tf.constant_initializer(0.), trainable=True)

# use weight normalization (Salimans & Kingma, 2016)
W = tf.reshape(g, [1, 1, 1, num_filters]) * tf.nn.l2_normalize(V, [0, 1, 2])

# calculate convolutional layer output
x = tf.nn.bias_add(tf.nn.conv2d(x, W, [1] + stride + [1], pad), b)

if init:  # normalize x
    m_init, v_init = tf.nn.moments(x, [0,1,2])
    scale_init = init_scale / tf.sqrt(v_init + 1e-10)
    with tf.control_dependencies([g.assign(g * scale_init), b.assign_add(-m_init * scale_init)]):
        x = tf.identity(x)

init = tf.global_variables_initializer()

sess.run(init)

# mean and var should be zero and unit after initialization
a = sess.run(x)
print np.mean(a)
print np.var(a)
sess.close()

Also I don't understand why in the code it is assign_add instead of assign. I do think the steps before initialization are happening before the assign, so the moments computed in the init step not from t=V*x/||V|| but from the output of the layer. I assume that the whole initialization step is scaled by g, and the bias.

openai / pixel-cnn

7000 loss value when training cifar10 #31