pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

Why does openTSNE use method=ball_tree and not method=auto for exact kNN? #162

Closed dkobak closed 3 years ago

dkobak commented 3 years ago

I realized that sklearn uses brute-force kNN always whenever dimensionality d>15, see https://scikit-learn.org/stable/modules/neighbors.html.

Why not have method='auto' in https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/nearest_neighbors.py#L107 and rely on sklearn to choose the method?

dkobak commented 3 years ago

Hmm, tried it out briefly and am not sure that their heuristic makes sense :-(

from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(n_neighbors=90, algorithm='ball_tree', n_jobs=-1)
%time neigh.fit(X[:10000, :784])
%time neigh.kneighbors_graph(mode='distance')

neigh = NearestNeighbors(n_neighbors=90, algorithm='auto', n_jobs=-1)
%time neigh.fit(X[:10000, :784])
%time neigh.kneighbors_graph(mode='distance')

Times:

Wall time: 695 ms
Wall time: 21 s
Wall time: 774 ms
Wall time: 47.2 s

The same for 20k:

Wall time: 1.65 s
Wall time: 1min 38s
Wall time: 2.24 s
Wall time: 2min 58s

The same for full 70k:

Wall time: 25.8 s
Wall time: 23min 46s
Wall time: 23 s
Wall time: 37min 2s

Now I am very confused. They changed it relatively recently based on the benchmarks that showed that brute-force works faster...

dkobak commented 3 years ago

I am going to close this issue... But if you have any thoughts, let me know please!

dkobak commented 3 years ago

Update: turns out my sklearn was 0.23 and not 0.24, and after installing the latest version brute-force became MUCH MUCH faster! https://github.com/scikit-learn/scikit-learn/issues/10044#issuecomment-794407653

Here are updated times for full MNIST:

CPU times: user 23.3 s, sys: 83.6 ms, total: 23.4 s
Wall time: 23.3 s
CPU times: user 3h 9min 11s, sys: 11 s, total: 3h 9min 22s
Wall time: 25min 18s
CPU times: user 26.2 ms, sys: 0 ns, total: 26.2 ms
Wall time: 25.9 ms
CPU times: user 14min 21s, sys: 1min 40s, total: 16min 1s
Wall time: 3min 48s

I guess we should change it in openTSNE. Re-opening.

pavlin-policar commented 3 years ago

I realized that sklearn uses brute-force kNN always whenever dimensionality d>15

That's seems really strange to me. I wanted to avoid brute-forcing kNN at all costs, because this involves forming the entire distance matrix, and this leads to out-of-memory errors all over the place once you start working with moderately-sized data. Assuming we have users running this on laptops, this seems like a huge no-no to me. I would very much like to know their reasoning behind this.

The other algorithms scikit-learns offers are KD-trees and Ball trees. From what I remember, KD-trees become really inefficient with dimensionalities > ~10, which is really low. Ball trees supposedly work better with higher dimensionalities, and has the same time-complexity as KD-trees. So I chose Ball trees as the default.

Maybe ball trees aren't optimal in all cases, and maybe brute force is faster in some scenarios, but I would assume those cases happen with really small data sets where the computation happens almost instantaneously. So in that case, I won't mind if the computation takes 2 seconds instead of 1 second, for instance. From the benchmarks you posted, it seems like ball trees are faster anyways for anything moderately sized.

In any case, I would probably prefer using annoy for anything above 10k samples, since it's probably going to be much faster. I really wouldn't worry about it too much. Even if we were to optimize the algorithm selection for finding the exact nearest neighbors, we're probably talking about shaving off a few seconds here and there, which really doesn't make a difference IMO.

pavlin-policar commented 3 years ago

It's only now loaded your newer comment.

In that case, yes, it would definitely make sense to change change the algorithm to "auto" and let scikit-learn do its thing. Do you have any idea what they did to make it so much faster? The only thing that would still make me somewhat hesitant if if they form the entire N x N distance matrix. That would still lead to memory problems.

dkobak commented 3 years ago

I don't think it makes up the NxN matrix, that wouldn't fit in my laptop RAM for N=70000. Also, it's not needed as one can process it in batches.

I have no idea what changed between 0.23 and 0.24 for brute-force but the speed increase is astonishing. Now SklearnTSNE is faster than MulticoreTSNE.

Looks like openTSNE with Barnes-Hut will be faster still, if we switch to "auto" ;)

dkobak commented 3 years ago

Update: apparently what changed is that in 0.23 "auto" was not running brute-force but one of the trees, likely KD.

pavlin-policar commented 3 years ago

I've found the PR: https://github.com/scikit-learn/scikit-learn/pull/17038. It seems that cython is confused when we conditionally acquire the GIL. I wasn't aware of that. I do still suspect that the entire distance matrix is formed, since they do just call scipy's cdist function, but we'll leave it up to them to handle batching if they so choose :)

In openTSNE, I suppose we've got to keep the BallTree class as it is, for backwards compatibility, then remove it in the next version. It would be confusing to have the class named BallTree and have it sometime use the brute force algorithm as well. So we should make a new class, maybe named Exact or Sklearn. We can just copy over the code from BallTree and remove the bits that handle cosine distance. The brute force implementation should have no issues with cosine distance. Annoy should remain the default, as it's still probably going to be faster for most use-cases. What do you think?

dkobak commented 3 years ago

Hmm but this PR seems to be about BallTrees... And I don't know what GIL is... Whatever :-)

I do still suspect that the entire distance matrix is formed, since they do just call scipy's cdist function,

Hmm but for MNIST, 70000 70000 8/1024/1024/1024 ~ 36 so it would need 36 Gb or maybe half of that, but I only have 16 Gb on my laptop and less than half was used when running sklearn.

It would be confusing to have the class named BallTree and have it sometime use the brute force algorithm as well.

Haha, actually this would be my preferred option for simplicity and backwards compatibility :-) Just requires one line change in https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/nearest_neighbors.py#L107. Is there a way to add an ExactNN class which is simply equal to BallTree? Then it'd be less confusing.

pavlin-policar commented 3 years ago

Is there a way to add an ExactNN class which is simply equal to BallTree? Then it'd be less confusing.

Yeah, we could definitely do ExactNN = BallTree, but then we'd inherit that cosine metric workaround, which might even interfere. And we'd still have a class name BallTree that sometimes runs brute force, sometimes KD trees, and only sometimes ball trees. Oh, and we'd have to update the valid distance metrics here. The more I think about it, the more I'm convinced making a new class is the way to go. The new class will also be much simpler than the current one, because of all the cosine logic.

dkobak commented 3 years ago

Fine.

Regarding NxN matrix, I think maybe batching happens here: https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/neighbors/_base.py#L705

dkobak commented 3 years ago

Embedding full MNIST without PCA pre-processing (n=70000, d=784) on my laptop:

Nice!!