Using kNN matrix directly in Barnes-Hut

gokceneraslan commented 6 years ago

It would be very useful to have a knn option, as an alternative to the experimental is_distance option for very large matrices, such that given n x k distance matrix can directly be plugged into bhTSNE.

jkrijthe commented 6 years ago

Thanks. I agree this would be a nice addition, although I do not have time to work on this myself, so I will leave this open in case anyone is interested in contributing this feature.

LTLA commented 6 years ago

Before I start my second attempt at this, I'll run some of the proposed changes past @jkrijthe.

The plan is to add a new computeGaussianPerplexity function to accept NN results (obviously). This will require isolation of the code in the inner loop (from 554 to 616) into a separate function to avoid a copy-paste.

The normalization step (lines 101-107) makes things a bit troublesome. Users who compute a distance matrix or NN results from X would expect to get the same results as supplying X to Rtsne(), but they won't (at least not on the first attempt) due to the internal normalization. The simplest solution would be to:

If a precomputed distance matrix or NN results are supplied, this is done via separate arguments in Rtsne() (possibly replacing is_distance= or is_neighhbors=). X itself also needs to be supplied.
Assume that any distances were computed from the supplied X.
Perform the internal normalization for X, and then apply the scaling factor to the distances. This means that the user can use whatever they precomputed without needing to be exposed to the normalization.

I'll suspect that the C++-side interface will become quite cluttered at this point, so some further refactoring to separate functionalities would be tempting, but I think I'll stop here for now.

jkrijthe commented 6 years ago

Thanks again for attempting this. Two thoughts:

Would it not be possible/enough to change line 562, to optionally skip the NN search and populate distances and indices with the precomputed distances? (and also skip the building of the Ball Tree), or does that not make sense?
With respect to the normalisation: I'm not so sure. For the precomputed distances I assumed the user wants to use the given matrix to do the computations, and did any normalisations before passing the matrix to Rtsne. I agree it is somewhat strange that for matrix input, a normalisation is used by default, given that this was done in the original implementation. But it might make more sense to me to be more explicit in the documentation about this, and provide examples for how to apply the normalisation in R for precomputed distances (I should at least add the example used in the unit tests to the documentation).

LTLA commented 6 years ago

Yes, that's also a valid approach. Though this will make computeGaussianPerplexity a bit tricky to read, because one function is now responsible for three different possible inputs - X, the precomputed distance matrix, and the nearest neighbour results. By having distinct functions that operate on different inputs (but with the common perplexity calculation factored out), I was hoping to clarify the situation and make it easier for future maintenance.

How about I put together a draft of what I'm thinking of, you can tell me if the changes are too much?
Sure, that's another way to do it. But my understanding from the code comments was that the normalization was being performed to avoid overflow/underflow problems in the downstream calculations. This seems to suggest that it is necessary regardless of what was supplied as input.

Also, an intuitive expectation of Rtsne() behaviour would be that the results should be the same (give or take the effects of numerical imprecision) regardless of whether I supplied the data matrix X directly, or precomputed the distance matrix from X and supplied that instead. In fact, realizing that this was not the case has made me look back on some of my own analysis code with some concern! Oh well.

Conversely, if the function is intended to respond to user normalization, it should arguably do so for X as well, not just for precomputed distance matrices. Perhaps a normalize= flag is in order? (I realize that I'm asking for a lot of extra arguments, but a single function that accepts many potential inputs is invariably going to be complicated.)

jkrijthe commented 6 years ago

Sounds good, will have a look at the result.
I agree on making normalise an additional argument. For the distance matrices, however, I guess I had a different idea of what the user should expect: currently you do get the same result regardless of using X as matrix or pre-computing "the" distance, but it might not have been clear what distance this was exactly (the "normalised" Euclidean distance, rather than the unnormalised Euclidean distance). Suppose we would want to do the normalisation for the precomputed distances: how would this be done for general distance measures?

LTLA commented 6 years ago

Regarding the distance normalization: I was only thinking of Euclidean. Let's say we have X, and distances (nearest neighbors, or the full distance matrix) D. We can compute a scaling factor from X:

# Can be done efficiently in C++:
zeroed <- sweep(X, 2, colMeans(X), "-")
scale <- max(abs(zeroed))

... and then divide D by scale to obtain the distances that we would have gotten from zeroed/scale (i.e., the normalized X). This should work for the Euclidean and Manhattan distances, at least. I suppose people can turn this off with normalize=FALSE for other metrics that are scale-invariant.

Okay, that's starting to get a bit complicated, so if you're planning to accept other distance measures, then assuming that the input is already normalized is probably the safest bet. Exposing the normalization code as a separate function would probably be helpful, that makes it easy for people to comply.

jkrijthe commented 6 years ago

I agree having the normalisation code as a separate function, or at least an example in the docs of how to do it is needed and currently lacking. But the fact the normalisation cannot be done for general distance measures makes me wary of adding the normalisation for distance matrices.

LTLA commented 6 years ago

Agreed. And in fact, rolling the normalization function into a separate R function (that is called directly in the Rtsne R function before entry into C++ code) would probably help to modularize the code even further.

gokceneraslan commented 6 years ago

Now that #39 is merged, we can close this, or wait until a new release. @jkrijthe, any estimates about the next release date?

jkrijthe commented 6 years ago

Good. Sorry, not sure about the next release yet, hopefully in the next 1-2 weeks.

jkrijthe / Rtsne

Using kNN matrix directly in Barnes-Hut #32