microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

Training RBM #646

Closed zpbappi closed 8 years ago

zpbappi commented 8 years ago

I am trying to write an RBM with CD1 learning. It does have an error function defined by Sum(features - v1_reconstruction).^2). However, my goal is _NOT_ to adapt the weights so that the error function is minimized. Rather, I would like to follow a specific update rule to update weights and biases. FYI, for CD1, there will be 3 parameters that I need to learn: weight between visible and hidden units and two biases in both directions.

I was hoping to use DummyCriterion. Looking at the source code, it expects _objectivefunction, gradient and _target_predictionnode. I thought this node will update "something" with the supplied gradient (of course, applying proper momentum and learning rate from SGD parameters). I am kind of stuck here. I am not entirely sure how I tell it to update the weights parameter with the gradient I supply? There may be more than one such parameter in a network and they all may have different gradient calculation.

Here is my network expression in (sudo)BS:

features = ...
Wvh = ... # ???
Bv = ... # ???
Bh = ... # ???

# positive phase
uniform_dist = Parameter (some_dim1, some_dim2, init='uniform', learningRateMultiplier=0)
pos_hid_prob = Constant(1) / (Constant(1) + Exp(-fetures*Wvh - Bh)
H = features' * pos_hid_prob
Hstate = H > uniform_dist
pos_prods = features' * pos_hid_prob

# negative phase
features_reconstruction = Constant(1) / (Constant(1) + Exp(-HState*Wvh' - Bv)
neg_hid_prob = Constant(1) / (Constant(1) + Exp(-features_reconstruction*Wvh' - Bh)
neg_prods = features_reconstruction' * neg_hid_prob

# error
err = SquareError(features, features_reconstruction)

# weight and bias updates
# ???

Now, the weight update should be a function of features, feature_reconstruction, pos_hid_prob and neg_hid_prob. I think, the gradient should be average of features' * pos_hid_prob - features_reconstruction' * neg_hid_prob. And, similar equations may be written for both visible bias unit (Bv) and hidden bias unit (Bh). As you can see, if I want to declare the gradients when declaring Wvh, Bv and Bh, I end up in a situation with circular dependency. Or, am I missing something and doing it the wrong way?

I do not quite understand how to proceed further to implement an RBM. I would really appreciate any help/direction.

frankseide commented 8 years ago

OK, it seems that we cannot do this presently. The reason is that RBM pre-training is no longer necessary. Let me analyze it, to see what one would have to add to enable CD-1.

One missing piece is the sampling. uniform_dist is only randomly initialized at startup, and never gets updated. This will require C++ code to be written. I am wondering whether we can wing it with Dropout(), but probably not.

The other missing piece is reduction over samples. To write the per-sample gradient (Wvh), you would say (translating your formula):

grad = TransposeTimes (features, pos_hid_prob) - TransposeTimes (features_reconstruction, neg_hid_prob)

which would then have to be reduced over all samples in the minibatch. Unfortunately, we would first have to compute the above products for every sample, which will be very big in memory, and inefficient. Internally, we already have code that can multiply and reduce directly, but there is presently no way to tell on BS level.

As for the criterion, below I copy the gradient code and source-code comment of the DummyCriterionNode below. You would need to supply 3 parameters:

So my apologies that I cannot offer a better answer at this point in time. I will raise this interally and see if and how we can address the above two gaps in a future version.


From the source code of DummyCriterionNode:

        // predictionsGradient += userSuppliedGradient * scalarGradientFromTop
        auto gradient = Input(2)->GradientFor(fr);
        Matrix<ElemType>::Multiply1x1AndWeightedAdd(+1.0f, /*gradient from top:*/Gradient() /*1x1*/,
                          /*user-supplied gradient:*/Input(1)->ValueFor(fr), 1.0f,
                          /*add to:*/gradient);

// DummyCriterionNode (objectiveValues, userSuppliedGradient, prediction)
//
// Apply user-supplied gradient, computed as Forward(), as the gradient into 'prediction'.
//
// predictionsGradient += userSuppliedGradient * scalarGradientFromTop
//
// This training criterion node allows to compute objectives and gradient
// with custom CNTK expressions (as Forward() computations). It has 3 inputs:
// 1. custom objective values to be summed up and passed up
// 2. custom gradient values to be passed down as the gradient into 'prediction'
// 3. prediction: the node to pass the custom gradient into
zpbappi commented 8 years ago

@frankseide you don't need need to apologize. You guys are building something wonderful that allows me to run training within minutes on my GPU. I thank you sincerely for that.

I understand that it is currently not possible to train an RBM using CNTK. I will wait for future releases, if you plan to include it. Meanwhile, I can pre-train RBM using Matlab/Octave and dump the weights in a file and then load it in CNTK as weight values to train the actual model. Haven't tried it yet, but it should work.

However, can you please give me some reference/reasoning for why "RBM pre-training is no longer necessary"? I thought it helps the network to train faster (may not be an issue for CNTK though) as well as it will reduce the vanishing/exploding gradient problem that may happen randomly due to random initialization of weights, and possibly for a large number of epochs. I am sure there are other reasons too for G. Hinton to introduce RBM pre-training. I saw CNTK TIMIT speech example. In the Autoencoder example, it uses 3 layers of SBFF net. What if someone uses a really deep autoencoder with large epochs and lot of data? Would it cause vanishing/exploding gradient? I would like to know the "other" side of the argument. :)

frankseide commented 8 years ago

I think another thing you could try (never tried myself) is to train the RBM with BP as an auto-encoder, instead of using CD-1. Not sure if this will work, and supposedly it will converge more slowly, but it may still solve your problem.

In our experience, RBM pre-training is helpful for dense DNNs of moderate numbers of layers (e.g. 8), but it is posssible to train the same-quality models without, especially if you have large data sets (systems trained on small data sets often don't work well anyway for real-life use). It does require more fiddling with hyper-parameters though. An alternative is to use layer-wise discriminative pre-training, which works as well although we have sometimes found it to over-train. Our original paper on training LVCSR speech systems with DNNs has a comparison of the methods.

Since then, the DNN world has moved on. Most recent models are not of the same kind as the paper, and they are not anymore trained with RBMs nor discriminative pre-training. For one, neither would help with very deep models like the recent 152-layer ResNet model (RBM-PT just give you a handful extra layers you can train more easily). Secondly, recurrent networks have no RBM equivalent. Third, recent techniques like ReLUs, ResNet, LSTMs, and batch normalization all address the vanishing/exploding gradient problem differently and no longer require pre-training.

Hinton himself describes RBMs were a catalyst that for the first time allowed to train large deep models and thereby kick-started DNNs, but they are no longer needed.

If you are OK, I will close this Issue now, but please do reopen it if you have more questions.