Closed KukumavMozolo closed 3 months ago
Hi, the default nearest neighbor search algorithms in openTSNE aren't the best at handling sparse data. They would probably densify the data first, which you probably don't want.
Instead, I would strongly recommend installing the pynndescent
package, which has wonderful sparse data support. Then, you can pass TSNE(neighbors="pynndescent")
and everything should work out-of-the-box as expected. In fact, if you have the pynndescent
package installed and you pass in sparse data, openTSNE will default to pynndescent (using the default TSNE(neighbors="auto")
parameter).
I hope this helps!
Hi @pavlin-policar, thank you for you quick response. I almost got it to work the way you suggested. I am currently limited to a machine that has 256gb of memory. I understand that pynndescent computes some approximate nn via some graph. While doing so it tries to allocate memory for an array that is a bit to large unfortunately:
pynndescent/rp_trees.py", line 1549, in convert_tree_format
hyperplanes = np.zeros((n_nodes, 2, hyperplane_dim), dtype=np.float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 267. GiB for an array with shape (32027, 2, 1119069) and data type float32
hyperplane_dim seems to correspond to the dimensionality of my data. It is not impossible to change but not very convenient, so maybe there is a way to reduce n_nodes a little?
Unfortunately, I don't think there is a way to change the parameters to pynndescent in the current implemenation of openTSNE. This has been on my todo-list for a while and has already been discussed in #256, but unfortunately, I haven't had the time to get to it.
However, you can always precompute your neighbors with your KNN implementation of choice and use that via the PrecomputedNeighbors
object. For instance, here's a simple example using scikit-learn:
from sklearn import neighbors
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
nn = neighbors.NearestNeighbors().fit(x)
distances, indices = nn.kneighbors(n_neighbors=15)
knn_index = openTSNE.nearest_neighbors.PrecomputedNeighbors(indices, distances)
tsne = openTSNE.TSNE(neighbors=knn_index)
z = tsne.fit(x)
So, you can use pynndescent directly to compute your nearest neighbors using whatever parameters you want, so it doesn't use too much memory, then pass in those neighbors as shown above.
Hope this helps!
This sounds tremendously useful, thanks a bunch!!
Following @pavlin-policar advice i threw together these lines of code using pynndescent
instead of sklearn.neighbors
:
import pynndescent
from openTSNE import TSNE, nearest_neighbors
index = pynndescent.NNDescent(embeddings, n_jobs = -1, verbose=True)
indices, distances = index.neighbor_graph
knn_index = nearest_neighbors.PrecomputedNeighbors(indices, distances)
tsne = TSNE(neighbors=knn_index, random_state=1, n_jobs=-1)
two_dim_embeddings = tsne.fit(embeddings)
Hi there, I am looking for advice on how to best use tsne with high dimensional sparse data. Maybe somebody here has some good directions? So currently i am working on a high dimensional dataset of size NxD where(N ~[100k, 1mil], D>1.2 mil). The dataset is very sparse, e.g. 1000 non-zero dimension per sample on average, where roughly half of these non-zero values are