pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.42k stars 157 forks source link

Implement binary uniform affinity #242

Closed dkobak closed 1 year ago

dkobak commented 1 year ago

This is a very minor suggestion, but I realized some time ago that in some papers we implement "uniform affinities" slightly differently: not as (P + P.T) / 2 but as (P + P.T > 0).astype(float) where P is the kNN adjacency matrix. The only difference is that the former has two possible non-zero values, whereas the latter has only one possible non-zero value, which is kind of neat.

This does not really make much of a different in practice, but I thought it may be nice to implement it in Uniform() using symmetrize="average" and symmetrize="or", in addition to symmetrize=True/False which this class has anyway.

tsne-uniform-two-kinds

pavlin-policar commented 1 year ago

I like this, but perhaps instead of "average" and "or", I would prefer "mean", and "max", which (to me) feel more standard. If we do go this route, I would also change the default parameter value from symmetrize=True to symmetrize="max". Do you think it would make more sense to default to the binary affinity kernel, or the current "mean" symmetrization?

dkobak commented 1 year ago

Makes sense, I changed it to "mean" and "max". Regarding the default value, I talked to @sdamrich and he suggested we use "max" as default, because that results in a conceptually simpler affinity matrix which can be seen as uniform on the "skNN" ("symmetrized kNN") graph. I don't have a strong opinion here myself. So I set "max" to default (but kept symmetrize=True as a possible input option that also defaults to "max"; symmetrize=False performs no symmetrization).

Note that this is going to change the default behaviour of this class, but I guess not many people will be affected.

pavlin-policar commented 1 year ago

I hate to be nit-picky, but could you please add a case if we get a string that isn't either "mean" or "max" and raise a ValueError with something like symmetrization {method} is not recognized. Just so we don't fail silently in case the user makes a typo or something like that.

Note that this is going to change the default behaviour of this class, but I guess not many people will be affected.

At first, I didn't really have any qualms about this, since probably not many people rely on this. But on second thought, it would probably be better to have a version with the same default as before, so "mean", but have a FutureWarning indicating that the default will change in the next version. So the roadmap might be 0.7.2 with the same behaviour as now, but with a warning, then I would change the default to "max" in the next version. Even though this change probably wouldn't affect many people, I don't want to get in this habit of assuming people don't use it, then we inadvertently break something.

dkobak commented 1 year ago

Makes sense. I changed it so that the default is True (that's how it was before) and this behaves like "mean" but triggers a FutureWarning.