Experiment - Githubissues

I have implemented a simple StarMixture pylearn2 layer. I tried it on the CIFAR-100 dataset. In that experiment, it is the only layer used. The experiment is still under way. The best validation accuracy I get is 18.62%. But it is hard to see why some hyperparameter configurations do better than others.

The kmeans gater is trained with on-line mini-batches, alongside the main backpropagation through the experts. I need a more suitable training algorithm than SGD such that I can disentangle gater from mixture hyper-parameters. It is very likely that an kmeans gated mixture of experts on the first layer of an image processor will not yield significant performance gains, if any, given the cost.

The images are high-dimensional (1024) and have many invariances to them, which makes them difficult to cluster.

We can experiment two approaches. The first is the traditional mixture of experts model which involves training different expert models and a gater model. The second involves the training the of parallel expert layers and a gater for each such layer of parallel layers. There is also the hybrid approach.

We can think of maxout as an instance of the second approach, but with a max gater. Which means that we could use the convmaxout layer to implement a conv mixture layer. The mixture of expert would be one of kernels, but where the gater is different at each input pixel. The gater could be local in the same way, or global to the whole image. The global gater would suffer from the same issues as mentioned above. The local gater on the other hand could perform gating at the level of kernels. Each kernel would be a mixture of experts. Maxout is such a model, but with a max winner-take-all gater. Because it is max, it requires no training. It is a heuristic.

The maxout paper thinks of these localized mixture of experts as special non-linear units (neurons). The maxout gater makes this non-linear. Our gater output will also provide the non-linearity. We might use the expert argmax of the gater as output, or a weighted average of expert outputs, where the weights of this average are the result of a softmax. Not sure if this is really necessary for non-linearity though.

To implement a conv kmeans mixture, we need two convnets. One like Ian's convmaxout layer. The other which has as outputs as experts. The gater convet should probably be smarter than the expert convnet. But maybe not. I fear using a convnet layer with 5 kernels, one per expert, will not yield a good gater.

Instead I should focus on the non-conv layers. The network should have 2 convolutional layers and one or 2 starmixture layers. I will need to change the change the base SGD algorithm to allow the use of momentum on the non-kmeans layers, i.e. all but the gater. Furthermore, I believe it would be interesting to see the effect of kmeans as hints as well as gaters. Possibly a specialized non-mixture expertiment. Testing star mixture on the output layers of a convnet should make it easier to cluster since convnets capture invariances, and output abstract features. This is not true for the first convolutional layers. Once we get that part working, we could try kernel gaters, but this is not our primary concern for now. Remember, time is limited.

I implemented a network with 2-3 convolutional layers and a final mixture layer. I was able to use the known_grads option of the theano.tensor.grad function within the MixtureCost to better control the combination of gradients from different directions. Two hyperparameters where added controlling the learning autonomy of experts and gaters with respect to each other in the gradient descent graph.

As I feared, the kmeans gater was not able to maintain useful clusters during training in the sense that a single cluster ended up owning all inputs. I tried many approaches to fix this including making the conv layers completely autonomous, i.e. they do not receive a backpropagated error gradient from the gater, only from the experts. I also tried tanh convolutional layers instead of rectified linear layers, to no avail. The solution was to add a pulling force on the centroids that brings them closer to the mean vector of each minibatch. Other solutions such as fuzzy k-means would probably have worked. The problem seemed to occur in the first epochs of learning where the gater input representations vary much, probably bringing outlier centroids furtheraway before the network stabilizes around smaller values owned by a single centroid. But this is not clear. I also thought momentum was to blame, but since I still use it, it seems that it was not the actual source of the issue.

We are running a batch of tests with various hyperparameters. We are very far from the state of the art on CIFAR-100.

A mixture of experts requires a gater and many expert models. Both gater and experts must work together to determine each expert's area of expertise. The gater must be powerful enough to distribute examples to experts. Gating is a kind of classifications, more so like clustering. In the classic mixture of experts model, the gater is trained to use experts that are likely to provide good results given the input. In the process, many experts are lost, which is to say rarely used, while the gater focuses its trust on the remaining subset of experts. This I have also noticed with our kmeans gater.

I am beginning to doubt some of my decisions. First there is the use of two-layer experts with a one-layer gater. Second I doubt my use of a kmeans gater. A gater trained using backpropagation seems to me more likely to yield better results than kmeans, especially if that gater has many layers. Kmeans is too weak.

A gater where each expert represents a class. The gater is then an imperfect classifier which each expert refines. (For a problem involving a great many classes, a form of semi-hard hierarchical gating would be interesting. But I still do not see how such a thing could be designed.) If the gater is first pre-trained as a normal classifier, and then an expert is pre-trained for each class on the examples predicted to be of that class by the classifier, we will end up with experts at refining the predictions of the gater. This would be very similar to boosting, except that boosting focuses on the examples that are wrong. However, in this case, the experts would be focusing on examples that are mostly of one class.

It seems that I am moving away from my original gating project. I will attempt to list my ideas so that they may be more easily discriminated:

Kmeans can be used as an attractor on hidden layers to encourage hidden representations to cluster.
- Kmeans can be used as a detractor on hidden layers to discourage hidden representations from clustering.
- Combination of the above
Using hard gaters to yield non-linear transformations without the need for non-linear activation functions for the experts
- Combining these into acyclic graphs, mainly trees.
- The input layer is a trunk, the hidden layers branch out until the leafs are recombined into output.
- Or each class to predict is one or many leafs.
- Vertices hold linear transformation matrices, nodes hold representations.
- Using soft gaters with hard gaters.
Using heuristics to keep the distribution of examples to experts balanced.
- Experts that are weak at all tasks could be reinitialized (from a more powerful expert)
- The gater generates a stochastic mask like dropout thus allowing the learning of different combinations of gates.
Using DAE or RBMs as layer-wise gaters
- The DAE learns the principal components.
- Each expert takes the input and maps it to a gate representing a principal component.
- Examples are propagated through the components that best represent it.
- Each gate contains only a small amount of units (~10).
- The concatenation of gate representations of a layer form a distributed representation for that layer.

Sadly, many of the above yield slow implementations when using Theano.

We are running out of time, here are our options:

Last layer is fuzzy kmeans gater with linear experts. This should fix the problem of convergence into a single cluster.
Gater is trained as a DAE. Experts are linear. We use stochastic DAE codes as masks, and average the remaining expert activations.
Continue with kmeans by adjusting hyperparameters.

Option 2 seems the most interesting to me. Whatever the case, we need to separate gater from expert training. For example, one iteration of expert training may require 2 iterations of gater training to compensate.

Algo:

From the input set, train a DAE with as many outputs as convolutional gates.
Train a CNN with DAE-sampled stochastic gate masks on output of first layer prior to spatial pooling.
Train DAE on input to second layer.

Discussed my situation with Roland Memisevic. He said that I should concentrate on using a typical backpropagation-trained gater. He said that gating the convolutional layers would be novel enough to merit interest. I told him that would be a piece of cake. Anyway, so this is the plan.

I am going to create a special MLP containing another MLP, i.e. the gater. A special layer will be constructed that contains a list of experts where each expert is just a normal MLP layer, in our case convrectifiedlinear and rectifiedlinear. Ideally I would like the mixing to occur before the convolutional pooling, but I wonder if it will really make a difference in our case. So lets just go with the easier solution. The gater will be as complex as an expert in each layer. This will make it powerful enough to effectively gate the propagation to the correct experts. It will receive error gradients from with the outputs from each layer. This should maximize learning.

Each mixture layer takes a composite input consisting of the the output of the gate network and the output of the previous mixture layer. The output is a not composite.

The baseline will be the state of the arts.

The bonus is stochastic masking.

Using recurrence for mixing expert outputs together.

nicholas-leonard / ift6085

Experiment #1