Test various fixes to blogcatalog training

wehlutyk commented 6 years ago

Run the training only for 2000 epochs, as most of the final quality seems to be reached there already.

[x] Use the Bernoulli decoder for features
[x] Use changes from #38
[x] Remove u from embedding parameters

wehlutyk commented 6 years ago

Currently running on grunch with the two first changes activated.

The third change should be implemented by creating a new Gaussian codec that doesn't take u as a parameter, then removing that layer from the model.

wehlutyk commented 6 years ago

The last run showed still the same poor performance. BUT!

It turns out there were two bugs, one in the ordering of the nodes in the adjacency matrix (thank you networkx for ordering nodes by insertion...), and one in the target feature values fed to the model. So both adjacency and features were being trained with bugs. Fixing that, and removing u from the embedding parameters, seems to work beautifully with 200 nodes. A script is now running with those fixes on the full BlogCatalog dataset, and we'll see how things are going tomorrow morning.

wehlutyk commented 6 years ago

43f7d4efbf54dbfe2e21469c5124a9737fa25e9f adds the exploration notebooks with the fixes for this issue.

wehlutyk commented 6 years ago

Commit eb01a5a08cf5fa9826c096e64aa7f2dbfb305793 shows that training works not bad with crops of BlogCatalog, provided you train for long enough:

200 nodes,
500 nodes, without and with adj scaling, neither of which reach the right density for adjacency reconstruction (without scaling makes a very low density, with scaling a much higher density)
1000 nodes, without and with adj scaling, and with both adj scaling and a 25D embedding (and layer 1, i.e. dims = [39, 25, 25]).

The main issue seems to be the 'star' nodes which are connected to nearly everybody: the embedding can't represent those other than by dedicating a dimension of variance to them (look at the right half of their embedding, which is the variance), to make them catch other nodes in the scalar product. (This is my interpretation of what's going on.)

Another interesting point is how much training improved on the 1000 crop when increasing the embedding dimension: this is a good illustration of our idea for exploring the cost function of the compression.

The next steps now are:

Full BlogCatalog training:
- [x] look at the results without adj scaling
- [x] launch with adj scaling and higher embedding dimension
49: use the current minibatch sampling strategy on BlogCatalog and see how that goes
while that computes, implement optimisations #23 and #24, and run small tests on changes recommended by #7
think about how to better sample minibatches, as explained in #7.

wehlutyk commented 6 years ago

Results on the full dataset, no adj scaling, are in 1d8467f989e4807e4b7239e9522893e450853f90: it's bad, which was expected for several reasons:

low embedding dimension
no adj scaling (which becomes necessary for larger datasets as seen above)
only 20000 epochs training, where the 1000 crop worked at 50,000-100,000 epochs. The problem is that this would take 10 days on our current hardware, so this is better tackled by moving on to optimisation and minibatch sampling.

One last test is running with adj scaling and a 25D embedding, just to check for improvements, before closing this issue, resuming #29, and moving on with the next steps.

wehlutyk commented 6 years ago

Results for the latest run with adj scaling and 25D embedding are in ac9894cca19250c382eba78e1a6501c8c465c59b. It's not good, but not in a surprising way. It indicates that for this size network we need to train for much longer. Closing this and moving on with the next steps.

ixxi-dante / an2vec

Test various fixes to blogcatalog training #48

49: use the current minibatch sampling strategy on BlogCatalog and see how that goes