supervised layerwise pre-training details

AminSuzani commented 10 years ago

Hi,

First of all, thanks for this super useful library. I need to learn about (and also cite) the "layerwise pre-training" (when we set optimize='layerwise'). I have two basic questions:

1- Assume we have a network with 4 hidden layers: [100, 200, 300, 200, 50, 20]. At first step, the network [100,200,20] is trained. At second step, does it train a [100,200,300,20] network (simply adding next layer and train) or a [200,300,20] network (using the pre-trained first hidden layer as input layer)?

2- Do you know any paper or tutorial that explains this approach? I searched and also looked through Bengio's works. I found several papers for unsupervised layerwise pre-training. However, here it seems that we are doing supervised pre-training (because output layer is always used in our pre-training steps). Am I wrong about this?

Thanks, Amin

lmjohns3 commented 9 years ago

Hi Amin -

Sorry for the delay in getting back with you! I just finished my thesis proposal so had some time last week to hunt for this citation. What I found was a NIPS paper that did investigate the performance of "supervised pre-training," but it found that such pre-training actually does worse than unsupervised pre-training! As a result I reworked the Layerwise pre-trainer a little bit last week to behave a bit more (but not totally) like an unsupervised pre-trainer. There is still some room for improvement here when training classifier-like models, but I don't have time to get to it at the moment.

The way the Layerwise trainer works is by "injecting" output layers into the network at successive hidden layers, and then training the injected layer as well as some subset of the layers below the injected layer. Originally, all layers below the "injected" layer were trained at each stage. For your example, the training sequence would modify the weights over the [100, 200, 20] subnetwork, then over the [100, 200, 300, 20] subnetwork, and so on. The change that I made last week was to restrict training to the topmost two layers (the "injected" layer and the topmost hidden layer), so the training sequence now still goes over [100, 200, 20], then [100, 200, 300, 20], and so on. However, in the updated trainer, the layers being updated are [100, 200, 20], then only [200, 300, 20], and so on. This avoids the problem of needing to re-encode the dataset at each stage, which can be problematic for datasets that are generated dynamically. Instead, the "fixed" portion of the network below the two topmost layers is used to deterministically compute a feedforward representation of the dataset.

Anyway, the change is not that drastic, but it does seem to work a bit better than before, and if nothing else the training time is a bit reduced because fewer gradients need to be computed. Hope that answers your questions.

andi1400 commented 8 years ago

Hi, first of all thanks for this great library which I am using a lot for master thesis project!

I'd like to cite your work, the library in general and especially the supervised pretrainer as described above. Do you have any preferred way of citing this, any paper or whatever?

Thanks, Andi

lmjohns3 commented 8 years ago

I don't have a paper written about theanets; I ought to fix that now that my dissertation is finished. For the time being just include a link to the source: https://github.com/lmjohns3/theanets. (I'm happy you're finding the library useful!)

For the supervised pretraining citation, look at Bengio, Lamblin, Popovici & Larochelle (NIPS 2006) "Greedy Layer-Wise Training of Deep Networks" http://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf -- the method used in the layerwise pretrainer is the "TrainGreedySupervisedDeepNet" algorithm, described in the appendix.

lmjohns3 / theanets

supervised layerwise pre-training details #28