Closed wehlutyk closed 6 years ago
Moving on with this, as it will make clearer what needs to be done for sparse feature normalisation and centring (#24).
Notes:
Minibatch learning will work if a minibatch gives a good sample of the whole dataset, in which case each minibatch will act as a full epoch pass, and the training will be sped up by the number of minibatches per epoch. So a good minibatch is one that includes:
In general, when training a model using SGD (i.e. minibatches of size 1), it's good to try and present inputs that will have the biggest output error, since those are the inputs from which the network learns most.
In our case we can't just look at individual nodes since we want to infer connections, but looking at couples is enough since the model has no actual view or notion of communities (communities come from our visual clustering of node embeddings). So in our case you could look at node couples as items of a dataset (made of all the possible couples in our graph), and a minibatch in the current implementation would then be seen as a set of couples sampled from a dataset of couples in our graph. Then we must sample couples that are as different as possible from one another on the three dimensions listed above (though without changing the frequency of the couples in the dataset, so it's a matter of creating a sequence of couples where each one is as different as possible from the previous one).
Input features should be centred and normalised on each dimension across the dataset (and not inside a given node). Input dimensions should also be independent if possible (which we can improve on with a PCA). So it's mean cancellation → PCA → scaling.
It's not clear what activation to use. LeCun'98 recommends symmetric sigmoid functions (such as tanh, scaled in some way that maps 1 to 1) to maintain mean zero in the output, possibly with an added linear term to avoid vanishing gradients. But the GCN and VAE articles use ReLU, possibly because they define parameters of a Gaussian (check this). Is it possible to reconcile these?
tanh
, which makes more sense. Only Kipf & Welling (2016) use ReLU in the GCN, for no reason in particular it seems (other that they're good for image recognition maybe, and GCN is an adaptation of CNN). So we'll switch to tanh
, in fact using 1.7159 tanh(2 x / 3)
as recommended by LeCun'98, and if we run into vanishing gradients then we can add a + ax
to skew the flat areas.target_func
in the notebooks) to be (2 ± √3) / (3 ± √3)
(≃ 0.21132 and 0.78868) instead of 0 and 1.For the activation function, another solution is to use the selu
activation which should take care of normalisation on its own.
Done reading "Efficient backprop". So the changes above need to be implemented now (it's all pretty simple). Next:
Closing this as it's read, and referenced in https://github.com/ixxi-dante/nw2vec/projects/1#card-13000883.
LeCun et al., 1998, "Efficient backprop" gives many tips and tricks for good neural networks.
For instance, some people on SO say that ReLU is not a good activation for auto-encoders as it loses more information than, say, tanh. Check if LeCun has opinions about this and more.