Open gclen opened 7 years ago
That definitely looks underwhelming. How does t-SNE compare, or PCA? There may be less structure in the data than one might like. It looks more likely, however, that those two outliers are somehow messing everything up. I'll see if I can get some time and look into exactly what is going on internally. I am fairly busy at the moment with other projects, so I can't promise anything immediate. Sorry.
If you have some time, the relevant thing to do is run the internals yourself step by step and look to see where things are getting swamped. In particular if you can build fuzzy simplicial set and look at the result (a sparse matrix) I suspect the distribution of non-zero entries will be suspicious (or, at least, the logs of them, since they are probably power law distributed). In particular you should look at the rows (and columns) associated to those two points that seem to end up at extremes.
Another alternative thing to look at is what happens if you don't use spectral initialisation.
I took a look at the things you suggested. Using a random initialisation still looks underwhelming but there are no huge outliers. There is slightly better separation using PCA but it is still not great (though I haven't messed around with parameters).
I constructed the fuzzy simplicial set and as suspected the distribution of logs of non-zero entries is suspicious. To compare the outlying rows to "normal" rows I calculated the log distributions (and sorted them) for the outlying rows and 10 rows selected at random. What I found was the largest values in the outlying distributions were much bigger than the largest values of the other rows. I'm not sure what this means but it's something. The updated notebook is located here. Let me know if you have any ideas for further tests.
Hi Graham,
Sorry for the very long delay on ever getting back to you on this. I got rather invested in building the new version of UMAP (which I was hoping would fix some of these issues) and then this fell off my radar for while. The new UMAP, using numba, is now in place, and I think it does fix some of your issues, though not all. I believe some of the rest of the apparent issues can be corrected by more careful plotting. The end result is that I don't believe we get what you want, but it looks less bad in doing so. In particular the default UMAP on your data gives this:
This is, admittedly, somewhat underwhelming. If we turn down n_neighbors
to 5 and set min_dist
to 0.0 we get the following (which shows more structure, but certainly doesn't separate your classes):
On the other hand, if we plot the PCA result in the same way we get this:
I think in your original iteration the apparent separation was a little bit due to plotting artifacts combined with the fact that the light blue class looks to have slightly larger variance (but ultimately they look like two overlayed gaussian blobs).
Finally, the new version of UMAP does support cosine distance so we can, at least, compute with cosine distance which makes more sense for doc2vec vectors. That results in the following:
Still not much notable separation of classes, but then given the PCA result, and these results, I am not sure there is actually good separation in 2D. I know that's not an ideal answer, or even what you were looking for, but hopefully it helps somewhat.
Those are some nice results @vb690 ; would you mind if I referenced them in the example uses section of the documentation?
Hi @lmcinnes , sure thing!
I trained a doc2vec model on the large movie review dataset and then tried to use UMAP to reduce the dimensions of the resulting document vectors. I had hoped that it would be possible to separate the documents by sentiment (positive and negative), but unfortunately the embedding is one big blob. A notebook can be seen here and the rest of the files for training the doc2vec model are in that repository as well.