rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[QST] TSNE perplexity parameter #2147

Open VisuMap opened 4 years ago

VisuMap commented 4 years ago

Does the TSNE algorithm implemented in cmml has a maximal perplexity? I used the following code to create a TSNE object, the perplexity parameter seems to have no impact on the result.

from cuml.manifold import TSNE tsne = TSNE(n_components = 2, method = 'barnes_hut', perplexity=5000) embedding = tsne.fit_transform(A)

viclafargue commented 4 years ago

Thanks for reporting this issue. The perplexity parameter should normally be set between 5 and 50 and the n_neighbors parameter should be at least 3 * perplexity in order to get satisfying results. Tagging @danielhanchen for more details.

danielhanchen commented 4 years ago

Oh so currently, TSNE uses an adaptive approach to determining the best perplexity. [As data size increases, so does perplexity]. To manually set it, call TSNE(..., perplexity = 1234etc, learning_rate_method = None, ...). Originally, it's TSNE(learning_rate_method = "adaptive"). See https://rapidsai.github.io/projects/cuml/en/latest/api.html#tsne for more details.

VisuMap commented 4 years ago

Does this mean that the maximal perplexity is 50 for BH-tSNE in RAPIDS?

danielhanchen commented 4 years ago

@VisuMap Oh. You can try any perplexity, ie 100, 1000, 10000 etc up to you. (Just don't forget to set TSNE(learning_rate_method = None, perplexity = 100000) However, as @wxbn mentioned, generally speaking, TSNE should be run with perplexities from 3 to 50 or so. However, with larger datasets, one should select larger perplexity numbers or else the algorithm cannot converge in a resonable time.

VisuMap commented 4 years ago

Thanks. The software happily accepts any perplexity values, but seems to silently ignore large values.

danielhanchen commented 4 years ago

https://github.com/rapidsai/cuml/blob/1395f233f2b0f52b55053e77e3f51a5e4db59a96/python/cuml/manifold/t_sne.pyx#L336

Actually @VisuMap , you are correct. I rechecked TSNE, and it seems to clip all values over the dataset size to the max dataset size: ie:

if self.perplexity > n: warnings.warn("Perplexity = {} should be less than the " "# of datapoints = {}.".format(self.perplexity, n)) self.perplexity = n

I'm pretty sure perplexity values over the # of datapoints becomes undefined. https://distill.pub/2016/misread-tsne/

the perplexity really should be smaller than the number of points.

However, in TSNE, to mimic large perplexity values, the update rule is as follows: y -= early_exaggeration * learning_rate * gains * dy You could try instead, increasing early_exaggeration or learning_rate and see if it helps.

Another more "hacky" approach is to manually increase the dataset size manually and pad with zeros to your desired perplexity.

VisuMap commented 4 years ago

Thanks. My dataset (from flow cytometry) has over 250 000 datapoints, but TSNE seems to clip the perplexity to a value blow 1000, which is way too small for my applications.

danielhanchen commented 4 years ago

@VisuMap Interesting. So is there a way to know whether perplexity has been clipped? Does Python issue a warning msg? If not, it's probably that higher perplexities are not affecting the final output of the embedding. Also, did you try what I suggested: tsne = TSNE(n_components = 2, method = 'barnes_hut', perplexity=5000, learning_rate_method=None)

VisuMap commented 4 years ago

Yes, I tried the suggestion, but it has no impact on resulting map. It only increased the learning time from 9 to 54 seconds. An indication that the perplexity has be clipped it that the training time always stayed at 54 seconds, regardless whether perplexity is 500 or 5000 or 100 000. It is well know, that the training time for bares_hut tSNE is sensitive to perplexity.

danielhanchen commented 4 years ago

@VisuMap Hmm fascinating. It's possible too large perplexity values might not be actually affecting the output. Although I can't say for sure. Have you tried maybe increasing the learning_rate or increasing / decreasing n_neighbors etc? https://rapidsai.github.io/projects/cuml/en/latest/api.html#tsne shows a whole list of possible tunable parameters.

Maybe standardising all columns could help so to bring the scaling of each column to be on par with one another during the perplexity search?

There is the obvious choice of using brute force, though I doubt that'll be fast... You could try say brute force on 1,000 of ur datapoints, increase perpelxity and see if even brute force does anything.

VisuMap commented 4 years ago

Thanks. Any n_neigbhors larger than 1024 causes TSNE.fit_transform() to terminate prematurely (after 10 minutes). It looks like TSNE clips perplexity to about 330.

danielhanchen commented 4 years ago

@VisuMap Oh yes n_neighbors > 1024 is not recommended. 1024 is the maximum # of n_neighbors a GPU can handle. [Possibly all ur GPU memory was full ie 250,000 1024 = 1.02GB # of columns ==> >= 2GB or so] It's possible then that large perplexity values are not affecting the output.

How about learning_rate [originally 200 maybe 400, 1000?] ? Or early exaggeration [originally 12, maybe 24, 48?] You can also try increasing n_iter [1000 to say 2000]

VisuMap commented 4 years ago

The software happily accepted n_neighbors=1000; but produced unusable maps. According to your estimation, my dataset would need at least 14GB which is above the 12GB limit of google-colab. This might be the reason for unexpected output maps I got.

Better learning-rate or early exaggeration would speedup the training, but not compensate the limitation of perplexity, which is the problem raised here.

danielhanchen commented 4 years ago

@VisuMap I'm mostly sure the code is correct in terms of perplexity, since it's only invoked during the bisection search for the best sigma for each datapoint. I can possibly recheck the code and get back to you.

But for now, it seems like maybe perplexities over 330 for that particular dataset aren't doing anything. It could be due to float32 precision or some other issue. I'll see what I can do.

dkobak commented 4 years ago

An indication that the perplexity has be clipped it that the training time always stayed at 54 seconds, regardless whether perplexity is 500 or 5000 or 100 000.

Assuming that cuML uses the standard heuristic of using 3*perplexity nearest neighbors, and assuming that n_neighbors is clipped to 1024 because it's the GPU limit, then increasing perplexity over 1024/3=341 will not increase the computation time (because n_neighbors saturated already) and increasing perplexity over 1024 will not affect the embedding (because it will yield the uniform kernel over 1024 n_neighbors).

@danielhanchen If this guess is correct, it would be good to document this in the docs, and also show a warning whenever perplexity>341.

dkobak commented 4 years ago

I briefly looked into the code and it seems that the n_neighbors is indeed clipped at 1024, but it also seems that the warning is being shown.

https://github.com/rapidsai/cuml/blob/branch-0.15/python/cuml/manifold/t_sne.pyx#L250 https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/src/tsne/tsne.cu#L48

(I am not sure if n_neighbors is set to 3*perplexity for any non-default perplexity value. I did not find it in the code.)

danielhanchen commented 4 years ago
 if n <= 2000:
        self.n_neighbors = min(max(self.n_neighbors, 90), n)
else:
         # A linear trend from (n=2000, neigh=100) to (n=60000,neigh=30)
         self.n_neighbors = max(int(102 - 0.0012 * n), 30)

Otherwise, n_neighbors defaults to 90 [ie 3 * 30]. Essentially, as n -> inf, n_neighbors drops to 30 (via a linear trend). This was just determined empirically since it seems to work for MNIST Digits / Fashion and other datasets.

dkobak commented 4 years ago

@danielhanchen I don't fully understand this heuristic here. The original BH t-SNE paper used n_neighbors = 3 * perplexity, and all existing implementations that I know of (including sklearn) use the same heuristic. That's the behaviour I would expect. Of course this means cuML can only meaningfully use perplexity up to 1024/3.

danielhanchen commented 4 years ago

@dkobak I think empirically, n_neighbors for extremely large datasets can cause GPU memory to be excessively used. So I can't remember which study I based my findings on, when learning_rate_method = "adpative" n_neighbors will linearly decrease to 30 to conserve performance and memory usage.

On the other hand, to revert to normal expected behaivor, setting learning_rate_method = None will use the old (ie 3 * perplexity) heurestic.

dkobak commented 4 years ago

I see. The problem here is that for large datasets your heuristic leaves perplexity fixed at 30 but sets n_neighbors also to 30. This result in a uniform (and not Gaussian) kernel over the 30 nearest neighbors. So it's quite different from standard t-SNE behaviour. I personally am a big fan of using uniform kernel (with 10-15 neighbors) because it's much faster, but that's not exactly what people expect from default t-SNE.

By the way, where does self.early_exaggeration = 24.0 if n > 10000 else 12.0 heuristic come from?

danielhanchen commented 4 years ago

@dkobak Hmm interesting. Oh I'll have to check where again, but it was from the original BH TSNE paper, where using early exaggeration = 24 was used on larger datasets. I remember there was no mention of what constitutes "large", hence my heuristic of 10,000

dkobak commented 4 years ago

@danielhanchen Hmm. I don't see it in the paper. http://jmlr.org/papers/volume15/vandermaaten14a/vandermaaten14a.pdf. I think it only says

In our experiments, we fix α = 12 (by contrast, van der Maaten and Hinton, 2008 used α=4).

More generally, increasing early exaggeration above 12 could potentially help for large datasets if the learning rate is held constant at 200. However, if the learning rate is scaled as N/early_exaggeration as we discussed in https://github.com/rapidsai/cuml/issues/2375, then I don't think that increasing the exaggeration value from 12 to 24 could make much of a difference in practice.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.