pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

Using a custom P matrix #183

Closed lihenryhfl closed 3 years ago

lihenryhfl commented 3 years ago

I want to use openTSNE with a custom P matrix. The documentation suggests that my custom P should be straightforward to substitute in. However, following the code example here, I noticed that the P matrix returned by openTSNE.affinity.PerplexityBasedNN, e.g. the P defined in the code snippet below:

import numpy as np
from openTSNE.affinity import PerplexityBasedNN

x = np.random.normal(size=(1000, 10))

affinities = PerplexityBasedNN(
    x,
    perplexity=30,
    metric="euclidean",
    n_jobs=8,
    random_state=42,
    verbose=True
)

P = affinities.P

does not have rows sum equal to 1, nor does it have the correct row perplexities. That is,

P = affinities.P.todense()

row_sums = np.sum(P, axis=1)
avg_row_sum = np.mean(row_sums) # this is not ~approx = 1

row_entropies = np.sum(P * np.log2(P + 1e-7), axis=1)
row_perplexities = np.power(2, row_entropies)
avg_row_perplexity = np.mean(row_perplexities) # this is not ~approx = 30

Therefore, I assume that P is normalized at a later point --- so I also need to replace it with my custom P at this later point. However, I couldn't find the code anywhere in openTSNE.TSNEEmbedding that performs the row sum and perplexity normalization.

Is the current standard to not normalize P? Or am I missing something here?

Aside: I noticed a potential bug in PerplexityBasedNN and MultiscaleMixture --- in both cases, the set_perplexity(...) / set_perplexities(...) functions compute the P matrix with symmetrize=True, irrespective of how the symmetrize flag was set during initialization.

pavlin-policar commented 3 years ago

So, what you're getting there is the properly normalized P matrix. The normalization happens here.

However, the normalization doesn't happen on a per-row basis, but on the entire matrix as a whole. So for instance, using your first snippet of code, running P.sum() gives 1, as expected. It's not a simple matter of scaling either. The steps to calculate the P matrix are as follows:

  1. Calculate the perplexities for each data point. This is the normalization you're expecting, I guess
  2. Symmetrize the entire P matrix, i.e.p{ij} = p{j|i} + p_{i|j}
  3. Renormalize so the entire matrix sums to 1. This is then used to calculate the KL divergence to the embedding similarities.

You would observe your expected behaviour when adding new data points into an existing embedding, where we calculate perplexities from new data points to existing data points. In this case, symmetrization is skipped, and all the data points are treated independently. So for instance, continuing your code snippet

tmp = affinities.to_new(x)
tmp.sum(axis=1)  # array of ones

As a simple wrapper class, I recommend something like this. This isn't included in openTSNE currently, but I might add it later on.

class ExistingAffinities(openTSNE.affinity.Affinities):
    def __init__(self, P, verbose=False):
        super().__init__(verbose=verbose)
        self.P = self.normalize(P)

    def to_new(self, data, return_distances=False):
        raise NotImplementedError(
            "`ExistingAffinities.to_new` is not implemented!"
        )

    def normalize(self, P, symmetrize=True, normalization="pair-wise"):
        if symmetrize:
            P = (P + P.T) / 2

        if normalization == "pair-wise":
            P /= np.sum(P)
        elif normalization == "point-wise":
            P = sp.diags(np.asarray(1 / P.sum(axis=1)).ravel()) @ P

        return P

Aside: I noticed a potential bug in PerplexityBasedNN and MultiscaleMixture --- in both cases, the set_perplexity(...) / set_perplexities(...) functions compute the P matrix with symmetrize=True, irrespective of how the symmetrize flag was set during initialization.

That's a good catch, it should respect the parameter. Thanks!

lihenryhfl commented 3 years ago

Ah! That is interesting! I certainly did not realize that there was a matrix-wise normalization step at the end. This makes sense, as Q is also normalized to sum to 1. It seems like this final matrix-wise normalization is very crucial for good convergence of tSNE. Thank you very much for the clarification :)

I shall close the issue.