jkrijthe / Rtsne

R wrapper for Van der Maaten's Barnes-Hut implementation of t-Distributed Stochastic Neighbor Embedding
Other
255 stars 66 forks source link

why check_duplicates? #42

Closed brooksambrose closed 6 years ago

brooksambrose commented 6 years ago

There was some speculation over at this SO question about what if any problems could arise from setting check_duplicates to FALSE. I just wanted to hear it from the devs, since the man page doesn't offer a rationale. Is it an estimation or speed/memory performance concern, or both?

My inkling would be to omit all dups and then assign them the same vector solved for the one included in the estimation. Think that's valid?

jkrijthe commented 6 years ago

I have not thought about this deeply and to be perfectly honest, I do not remember why I made this choice initially. I seem to recall one of the approximations potentially had problems with duplicates, but I'm not sure anymore. Of course, I suppose allowing for duplicates properly might require some decisions that are not explicitly considered in the current implementation (for instance, perhaps it makes more sense to represent the duplicated objects as a single object with more weight, rather than separate points in the embedding, to prevent the duplicated objects from not being embedded in the same location).

Laurens van der Maaten (lvdmaaten/bhtsne) probably knows more about if or in what cases duplicates could lead to performance issues.

brooksambrose commented 6 years ago

Fair enough!

mmahmoudian commented 4 months ago

@jkrijthe @brooksambrose I was about to file a bug report, but I thought I asked you two first about the purpose of this duplicate checking. Is Rtsne trying to check for duplicated columns, duplicated rows, or duplicated values? I can understand checking duplicates among columns (after all, we are trying to do some sort of directionality reduction. But the current code checks for duplicated value (as if the input is a long vector:

https://github.com/jkrijthe/Rtsne/blob/47ee0baee85e0a43914903a23ab23ba7ea9980b0/R/Rtsne.R#L157C9-L159

This current behavior is hard for me to digest and it breaks in many occasions. Let's have a simple example:

library("Rtsne")
Rtsne::Rtsne(iris[, 1:4])
Error in Rtsne.default(X, ...) : Remove duplicates before running TSNE.

Here are two issues:

  1. The error is extremely vague to the end user
  2. the Rtsne_cpp has zero issues processing this dataframe:
    Rtsne::Rtsne(iris[, 1:4], check_duplicates = F)

By chance in a large enough data, it is very possible that two values in the same dafarame/matrix have the same exact values. For the same of the argument, this issue is super easy to replicate in any dataset if you scale the data and remove the decimal precision to really increase the chance of collision/duplicate in the set of provided values. In the case above, iris has precision of up to one decimal point, which increases the chance. In the following snippet, you can use any dataset you like and it will cause the error (useful for reproducing it in any data to fix the issue):

# any data you have at hand
data |>
  scale() |>          # scale to make sure our upper bound is not to Inf, and round() works
  round(digits=1) |>  # of course you can even reduce the digits to zero to really bump up the collision rate
  Rtsne::Rtsne()

I am willing to work this out and provide a PR and address both aforementioned issues, but at the moment I don't know what is the true intention behind this checking.

SamGG commented 4 months ago

Hi, I think you should open a new issue as this one is closed, even if your topic is clearly related to this issue. I read the SO link, but nothing strikes me. IMHO, if one wants to ignore duplicate, just set check_duplicates = F. The error message could be improved, but the default could not be change as many codes may depend on the default values. IMHO, if there are a few identical points within a dataset, no much trouble will occur. If we imagine there are a lot of identical points for one profile, the NN algorithm may always select the same nearest neighbors and ignore the next nearest neighbors. This may result in misplaced points. In my experience, when there are a hand of such profiles, the map shows coils. Thanks for your point.

mmahmoudian commented 4 months ago

@SamGG

Thanks for your points. The following are my response to your remarks:

I read the SO link, but nothing strikes me.

My point is even more fundamental that anything that is discussed in that SO. I suggest ignoring the So and focusing on simple examples I provided.

IMHO, if one wants to ignore duplicate, just set check_duplicates = F.

As I explained, this is logically wrong because there is always a chance that two values in the same table be identical. That should not throw an error, especially because the C++ tSNE works just fine with those identical values.

the default could not be change as many codes may depend on the default values.

Allow me to disagree, simply because this is not a breaking change. We have three options:

  1. One can keep the check_duplicates argument and simply ignore it in the body of the function (common practice in some circles). This way for those which the function was working, it will still work, and for those which was breaking, now it works correctly and without any issues.
  2. software are getting updates all the time, and changes happen. This is the reason we have the versioning system. Major versions are incriminated when a breaking change is introduced. Fear of change (in this case change = improvement) is extremely nonconstructive to say the least.
  3. I suggest taking a look at the lifecycle package which is literally designed to handle such changes. many packages including all Tidyverse packages use it. There is a very nice explanation of this package here: https://lifecycle.r-lib.org/articles/stages.html

IMHO, if there are a few identical points within a dataset, no much trouble will occur.

As I shows with those iris examples, it will cause error. Even one single "duplicate" cause the error.

If we imagine there are a lot of identical points for one profile, the NN algorithm may always select the same nearest neighbors and ignore the next nearest neighbors. This may result in misplaced points. In my experience, when there are a hand of such profiles, the map shows coils.

I agree that the underlying algorithm should handle the situation, but even if we want to alert the user about certain issues, we should put a threshold for this. For example more than 10% duplication or something. being sensitive to a single duplicate value just hurt the end user.

Also, when a user see such cryptic error (or any error with their data), they will just move on to using another package instead.

I think you should open a new issue as this one is closed, even if your topic is clearly related to this issue.

Thanks. I might ultimately do that if this issue doesn't get traction or response from maintainers.

SamGG commented 4 months ago

I think we will not agree on how to manage changes. So, I will not argue on that. I agree that the messages should be less cryptic, even in the documentation ("check_duplicates logical; Checks whether duplicates are present. It is best to make sure there are no duplicates present and set this option to FALSE, especially for large datasets (default: TRUE)"). But I think that a user should know what an algorithm is doing when using it for one task in an analysis. Just passing to another algorithm because there is an error in the initially selected algorithm is something I don't understand. It's very easy to make an algorithm fail, or, even worse, to badly format its input and get wrong outputs. Finally, we are not in phase on what we call an error.

IMHO, if there are a few identical points within a dataset, no much trouble will occur.

As I shows with those iris examples, it will cause error. Even one single "duplicate" cause the error.

My point is on the numerical result, and I consider that the result is OK when ignoring duplicated points, which leads to my "no trouble/error" when admiting to ignore duplicates. Your point is on the fact that the function call does not lead a result. You are right, strictly speaking this an error, but I look further ahead.

It was interesting to exchange, but I will spend my time on things I can improve. Best.

jkrijthe commented 4 months ago

I don't think I have such to add to my earlier response in this thread: the reason the check is there is because this is a check in the original C++ implementation this was based on, and I did not think deeply about the how to handle duplicates (there are multiple possible reasonable ways, I suppose), or check how the loss, gradients or approximations might be affected. I therefore leave it to the user to decide how to handle duplicates, or if they are willing to use the implementation for this unsupported use. In short, my contention is that it's not clear whether "C++ tSNE works just fine" in these cases, so it's up to the user to decide if they want to use it in that way.

I do not consider this an "error", rather a warning to the user to explicitly have to make a choice on how to handle duplicates. Even in the basic examples we explicitly show to remove duplicates, because indeed, you often have them in the data.

Regarding your first point about the vagueness of the error: the motivation for the current message is that I assumed the term "duplicate" is clear in R due to the naming of the "duplicated" function. But if you have better suggestions I would be happy to hear them, of course.

mmahmoudian commented 4 months ago

@jkrijthe the main reason it is vague is because the user provides a table (dataframe or matrix) and what they hope for is reducing the dimensions (i.e columns). So when I as a user read that there are duplicates, the first thing comes to mind is duplicate columns the next things is duplicated rows, and the last thing is duplicated values.

I believe the better form of informing the user is via a warning than an error (i.e using warning() rather than stop()). Also, the warning can let the rest of the code run and give user the result albeit with a warning to warn them about potential issue.

Regarding the text, it can be much more informative and guide the user to what is the issue. Something like "There are two or more values in the provided dataframe that are identical. This can potentially cause issues while calculating the tSNE."

If you are onboard with these suggestions, I can make a PR and then the implementation and the text can be fine-tuned. Of course, it is your software/project, and my comments are merely a suggestion towards user-friendliness.

jkrijthe commented 3 months ago

Sorry for the late reply and thanks for the suggestions. Given that the embedding of duplicate points is currently non-supported behaviour, I would prefer to keep it as an error rather than a warning, so the user has to explicitly decide they want this behaviour. I agree the error text can be more informative. Perhaps something like: "There are duplicate rows in the input. Please remove these duplicate rows or explicitly allow running the embedding in this (non-supported) setting using check_duplicates=FALSE."