pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

Precomputed Matrix Feature #170

Closed fsvbach closed 3 years ago

fsvbach commented 3 years ago

Hi developers,

firstly, thanks a lot for this nice package! I have a request for another feature:

It seems to me, that openTSNE doesn't support a precomputed distance matrix as given input. However, this is available on sklearn, so I wondered if it was planned to offer the feature with openTSNE as well?

Thanks a lot, Fynn

dkobak commented 3 years ago

Actually -- it might already work with metric='precomputed', neighbors='exact' using sklearn implementation. Have you tried?

fsvbach commented 3 years ago

I get the ValueError: Sklearn does not support the precomputed metric. Please choose one of the supported metrics: braycurtis, canberra, chebyshev, cityblock, dice, euclidean etc.

if I run

from openTSNE.sklearn import TSNE tsne=TSNE(metric='precomputed', neighbors='exact') tsne.fit(M)

M being a distance matrix.

pavlin-policar commented 3 years ago

Thanks for bringing this up. This feature has been requested multiple times, and I do think it would be nice to have. So, I'll try to implement this soon and get it in the next release.

I've been thinking about how to structure the API for this. To implement this, we'll have make another affinity class which we'll be able to pass to in the precomputed distances. This is easy. However, I do not like the API we currently have, where we pass in affinity objects to the TSNE constructor. So, what we currently do is

aff = Affinity(...)
embedding = TSNE(affinity=aff, ...).fit(x)

The .fit(x) bothers me. Currently, we need it because we use it to calculate initialization. But even if we specify the initalization in the constructor, we still need to pass something there. E.g. for a graph with no features, we'd need to do something like

aff = Affinity(...)
init = np.random.random((N, 2))
embedding = TSNE(affinity=aff, init=init, ...).fit(np.ones(N))

I do not like this at all, and would like to change it to something like

embedding = TSNE().fit(affinity=aff, init=init)

This would be much cleaner IMO. So .fit(x) would change to .fit(x=None, affinity=None, init=None). Then, the standard t-SNE calls don't change at all. If we want a custom affinity model using PCA init, we could call

embedding = TSNE(init="pca").fit(x=x, affinity=aff)

And for any arbitrary graphs with no features, we could either 1) specify the .fit(init=...) parameter, or 2) calculate the spectral initalization e.g.

embedding = TSNE(init="pca").fit(affinity=aff)
# raises warning "pca needs `x` to be specified, falling back to spectral init

@dkobak Do you have any thoughts on this?

dkobak commented 3 years ago

ValueError: Sklearn does not support the precomputed metric.

But that's weird -- sklearn does support precomputed metric: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors

dkobak commented 3 years ago

To implement this, we'll have make another affinity class which we'll be able to pass to in the precomputed distances.

Can't we keep the affinity classes as they are and let Sklearn find exact NNs using metric='precomputed' if affinity object is called with metric='precomputed'?

However, I do not like the API we currently have, where we pass in affinity objects to the TSNE constructor.

I think this is unrelated, isn't it? :-) But I fully agree with everything you wrote there, I think your suggested API is better.

dkobak commented 3 years ago

Oh I see that precomputed is not listed in the available sklearn metrics here: https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/nearest_neighbors.py#L96 -- hence the ValueError. What happens if we add it there?

pavlin-policar commented 3 years ago

Can't we keep the affinity classes as they are and let Sklearn find exact NNs using metric='precomputed' if affinity object is called with metric='precomputed'?

Hmm, theoretically, we could do that, but that would make this usage really convoluted. And then the PCA initialization might be calcualted by doing PCA on the distance matrix, which probably doesn't make much sense. I would prefer that there is one, clear way to do this.

I think this is unrelated, isn't it?

Yeah, kind of. But this is the way I propose we implement this, so maybe it's not that unrelated :)

dkobak commented 3 years ago

Hmm, theoretically, we could do that, but that would make this usage really convoluted. And then the PCA initialization might be calcualted by doing PCA on the distance matrix, which probably doesn't make much sense. I would prefer that there is one, clear way to do this.

Not sure I understand. What is convoluted about this?

# D is the NxN distance matrix
embedding = TSNE(metric='precomputed', init='spectal').fit(D)

PCA initialization definitely does not make sense here, so when metric='precomputed', there should be a warning if init='pca' and it should fall back to random or spectal.

Update: Actually PCA initialization could make sense because one can compute what is called "PCoA" -- eigenvectors after double-centering the distance matrix; but it's also fine to fall back to random or spectral, imho.

pavlin-policar commented 3 years ago

Not sure I understand. What is convoluted about this?

Nothing. That's exactly the interface I implemented in #173. But in the code, it's routed to a special PrecomputedDistanceMatrix class, so that we can properly error if somebody later tries to call .transform. I don't know what would happen in that case if we went through the existing Sklearn neighbors class.

Anyhow, I'll merge #173 when the tests pass (unless either of you have anything against it). Then I would ask @fsvbach to pull the latest master and test if it works as you expected.

fsvbach commented 3 years ago

Yes, thank you very much @pavlin-policar . When I install the latest master it accepts precomputed matrices!