pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.45k stars 160 forks source link

Advice on how to use this Repo with sparse data #259

Closed KukumavMozolo closed 3 months ago

KukumavMozolo commented 3 months ago

Hi there, I am looking for advice on how to best use tsne with high dimensional sparse data. Maybe somebody here has some good directions? So currently i am working on a high dimensional dataset of size NxD where(N ~[100k, 1mil], D>1.2 mil). The dataset is very sparse, e.g. 1000 non-zero dimension per sample on average, where roughly half of these non-zero values are

  1. and the other half is in[0..1] and i would assume dimensionality of the span of the data is roughly ~100k
pavlin-policar commented 3 months ago

Hi, the default nearest neighbor search algorithms in openTSNE aren't the best at handling sparse data. They would probably densify the data first, which you probably don't want.

Instead, I would strongly recommend installing the pynndescent package, which has wonderful sparse data support. Then, you can pass TSNE(neighbors="pynndescent") and everything should work out-of-the-box as expected. In fact, if you have the pynndescent package installed and you pass in sparse data, openTSNE will default to pynndescent (using the default TSNE(neighbors="auto") parameter).

I hope this helps!

KukumavMozolo commented 3 months ago

Hi @pavlin-policar, thank you for you quick response. I almost got it to work the way you suggested. I am currently limited to a machine that has 256gb of memory. I understand that pynndescent computes some approximate nn via some graph. While doing so it tries to allocate memory for an array that is a bit to large unfortunately:

pynndescent/rp_trees.py", line 1549, in convert_tree_format
    hyperplanes = np.zeros((n_nodes, 2, hyperplane_dim), dtype=np.float32)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 267. GiB for an array with shape (32027, 2, 1119069) and data type float32

hyperplane_dim seems to correspond to the dimensionality of my data. It is not impossible to change but not very convenient, so maybe there is a way to reduce n_nodes a little?

pavlin-policar commented 3 months ago

Unfortunately, I don't think there is a way to change the parameters to pynndescent in the current implemenation of openTSNE. This has been on my todo-list for a while and has already been discussed in #256, but unfortunately, I haven't had the time to get to it.

However, you can always precompute your neighbors with your KNN implementation of choice and use that via the PrecomputedNeighbors object. For instance, here's a simple example using scikit-learn:

from sklearn import neighbors
from sklearn import datasets

iris = datasets.load_iris()
x = iris.data

nn = neighbors.NearestNeighbors().fit(x)
distances, indices = nn.kneighbors(n_neighbors=15)

knn_index = openTSNE.nearest_neighbors.PrecomputedNeighbors(indices, distances)
tsne = openTSNE.TSNE(neighbors=knn_index)
z = tsne.fit(x)

So, you can use pynndescent directly to compute your nearest neighbors using whatever parameters you want, so it doesn't use too much memory, then pass in those neighbors as shown above.

Hope this helps!

KukumavMozolo commented 3 months ago

This sounds tremendously useful, thanks a bunch!!

KukumavMozolo commented 3 months ago

Following @pavlin-policar advice i threw together these lines of code using pynndescent instead of sklearn.neighbors:

import pynndescent
from openTSNE import TSNE, nearest_neighbors

index = pynndescent.NNDescent(embeddings, n_jobs = -1, verbose=True)
indices, distances = index.neighbor_graph
knn_index = nearest_neighbors.PrecomputedNeighbors(indices, distances)
tsne = TSNE(neighbors=knn_index, random_state=1, n_jobs=-1)
two_dim_embeddings = tsne.fit(embeddings)