Read "Efficient Backprop" to check for mistakes in the implementation

wehlutyk commented 6 years ago

LeCun et al., 1998, "Efficient backprop" gives many tips and tricks for good neural networks.

For instance, some people on SO say that ReLU is not a good activation for auto-encoders as it loses more information than, say, tanh. Check if LeCun has opinions about this and more.

wehlutyk commented 6 years ago

Moving on with this, as it will make clearer what needs to be done for sparse feature normalisation and centring (#24).

wehlutyk commented 6 years ago

Notes:

Minibatch learning will work if a minibatch gives a good sample of the whole dataset, in which case each minibatch will act as a full epoch pass, and the training will be sped up by the number of minibatches per epoch. So a good minibatch is one that includes:
- nodes that have different feature values
- nodes that have different 2-hops neighbourhoods in terms of features and structure (so that the convolutions up to the embedding are also well sampled)
- different dependency relations between node feature, node neighbourhood, and connectedness between the nodes (for a pair of nodes, the 8 combinations of {same, different} features / {same, different} neigbourhood features / {connected, not connected})
In general, when training a model using SGD (i.e. minibatches of size 1), it's good to try and present inputs that will have the biggest output error, since those are the inputs from which the network learns most.

In our case we can't just look at individual nodes since we want to infer connections, but looking at couples is enough since the model has no actual view or notion of communities (communities come from our visual clustering of node embeddings). So in our case you could look at node couples as items of a dataset (made of all the possible couples in our graph), and a minibatch in the current implementation would then be seen as a set of couples sampled from a dataset of couples in our graph. Then we must sample couples that are as different as possible from one another on the three dimensions listed above (though without changing the frequency of the couples in the dataset, so it's a matter of creating a sequence of couples where each one is as different as possible from the previous one).
Input features should be centred and normalised on each dimension across the dataset (and not inside a given node). Input dimensions should also be independent if possible (which we can improve on with a PCA). So it's mean cancellation → PCA → scaling.
It's not clear what activation to use. LeCun'98 recommends symmetric sigmoid functions (such as tanh, scaled in some way that maps 1 to 1) to maintain mean zero in the output, possibly with an added linear term to avoid vanishing gradients. But the GCN and VAE articles use ReLU, possibly because they define parameters of a Gaussian (check this). Is it possible to reconcile these?

wehlutyk commented 6 years ago

Activation function: in fact the two VAE papers (Rezende et al., 2014, and Kingma & Welling, 2013) use tanh, which makes more sense. Only Kipf & Welling (2016) use ReLU in the GCN, for no reason in particular it seems (other that they're good for image recognition maybe, and GCN is an adaptation of CNN). So we'll switch to tanh, in fact using 1.7159 tanh(2 x / 3) as recommended by LeCun'98, and if we run into vanishing gradients then we can add a + ax to skew the flat areas.
Target values: should be chosen to be at the point of maximum second derivative of the final activation function. Otherwise the target is at 0 or 1, and that encourages the training to saturate the final nodes (i.e. having the sigmoids close to 0 and 1), which leads to very high weights. This doesn't apply to the Gaussian codec, but in the case of the [Adjacency]SigmoidBernoulli codecs we could set the target values (defined in target_func in the notebooks) to be (2 ± √3) / (3 ± √3) (≃ 0.21132 and 0.78868) instead of 0 and 1.

wehlutyk commented 6 years ago

For the activation function, another solution is to use the selu activation which should take care of normalisation on its own.

wehlutyk commented 6 years ago

Done reading "Efficient backprop". So the changes above need to be implemented now (it's all pretty simple). Next:

wehlutyk commented 6 years ago

Closing this as it's read, and referenced in https://github.com/ixxi-dante/nw2vec/projects/1#card-13000883.

ixxi-dante / an2vec

Read "Efficient Backprop" to check for mistakes in the implementation #7