haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.04k stars 1.13k forks source link

Handling of identical objects in DBScan #54

Closed badgersow closed 8 years ago

badgersow commented 8 years ago

Hi,

First let me thank you for the great library, it is pleasure to use it.

In my work project I use DBScan class to cluster a bag of strings. I have noticed that by default identical objects are ignored and not clustered together.

Please consider the following test:

@Test
public void testIdentical() {
    String[] dataset = new String[]{"a", "a", "a", "a", "b"};
    Distance<String> distance = new Distance<String>() {
        @Override
        public double d(String x, String y) {
            return x.equals(y) ? 0.0 : 1.0;
        }
    };

    DBScan<String> clusterer = new DBScan<>(dataset, distance, 2, 0.5);
    int[] labels = clusterer.getClusterLabel();
    assertTrue(labels[0] != DBScan.OUTLIER);
}

It is failing because strings "a" are not clustered together and are considered as outliers. I believe this behavior is not intuitive. The problem is not critical and I managed to workaround this with the following lines:

LinearSearch<String> search = new LinearSearch<>(allNames, JARO_WINKLER_DISTANCE);
search.setIdenticalExcluded(false);
DBScan<String> clusterer = new DBScan<>(allNames, search, MIN_PTS, RADIUS);

But nevertheless, do you think the default behavior is ok? Should we at least provide constructor to DBScan to allow to create model with identical objects not excluded?

Thanks for your time.

haifengl commented 8 years ago

Thanks! As you may figure it out yourself, DBSCan use RNNSearch inside. When a Distance or Metric is supplied, we wrap it with LinearSearch or CoverTree. setIdenticalExcluded(true) is important for some uses including DBScan. But for your use case, it is different due to the meaning of JAR_WINKLER_DISTANCE. I will check if I should update the semantical meaning of setIdenticalExcluded, for example it means exactly same object (not just distance 0). I have to double check before making this change. Thanks!

haifengl commented 8 years ago

BTW, can you please contribute your JARO_WINKLER_DISTANCE implementation to SMILE? Thank you very much in advance! If you agree, please put it in package smile.math.distance. Thanks again!

haifengl commented 8 years ago

Nearest neighbor search is used in many density estimation and clustering algorithms. In most situations, distance 0 means that two objects are the same in the feature space. Without excluding identical data points, we may get into troubles in some cases. It is why we do setIdenticalExcluded(true) by default. I think that it is reasonable and also more efficient. But in your use case, many objects have 0 distance even if they are different. Your approach with customized LinearSearch is correct and I am glad it works. The problem comes from the distance definition not DBSCan. I won't change the default behavior otherwise many other algorithms will not work correctly. Thanks!

badgersow commented 8 years ago

Thank a lot you for fast and detailed reply.

I think I was not clear last time with my explanation, please let me rephrase it. My task is to cluster the set of strings into groups. In my case, distance 0 is possible only with equal or identical (equal by reference in Java terms) strings.

The problem occurs when I have two identical Java strings in dataset. They have distance 0 and they are reference-equal. I expect them to be in one cluster, but because of identicalExcluded = false, objects are labeled as noise.

You can check that with dataset {"123", "123"} strings are labeled as noise because they are the same Java objects (thanks, Java string pool), but with dataset {new String("123"), new String("123")} objects are clustered together because they have different references and distance 0.

In my opinion this example exposes inconsistent behavior because the algorithm depends on reference equality in Java language.

Anyway, I am not a specialist at machine learning, I just want to be sure you understand me right because of my terrible English. Thanks again for reviewing this issue!

About Jaro-Winkler distance implementation In my case I used implementation provided by Apache commons library, please see code below:

public static final Distance<String> JARO_WINKLER_DISTANCE = new Distance<String>() {
    @Override
    public double d(String from, String to) {
        return 1.0 - StringUtils.getJaroWinklerDistance(from, to);
    }
};

Do you think I should add dependency below in SMILE and add distance to package smile.math.distance?

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.3.2</version>
</dependency>

Many thanks for your time.

haifengl commented 8 years ago

Thanks for explanation. Value equality and reference equality are treated differently in almost all programming languages. I feel that our behavior is correct in this sense. In DBScan, we need to query every object's neighborhood. If we return the query object in the neighborhood, we will get into a self-reference loop. Although our code avoids getting into dead loop, it is still not a desired behavior to including the query object in the neighborhood. It is the purpose of setIdenticalExcluded.

So it is clear that Java pool string literal, which is the root cause. As you said, new String() is probably the right solution. With other data, I observe that the result is worse if I do setIdenticalExclude(false). Even though it may work for you, be careful with the clustering results.

Since you use apache commons for Jar-Winkler distance, it is not necessary to include it here. Thanks!

badgersow commented 8 years ago

Thanks for clear and detailed explanation. With your permission I'd like to add couple of lines to documentation of RNNSearch class, explaining default behavior.

If you are OK with this, I'll open pull request.

Also, I will follow your advice and rewrite my code to use new String() approach

haifengl commented 8 years ago

Sure. Go ahead adding comments and make a pull request. Thanks!

haifengl commented 8 years ago

Btw, not sure your use case. But usually BKTree and CoverTree are much faster than linear search. We do have a very efficient implementation of edit distance. If your data is large, maybe worth of a try.

badgersow commented 8 years ago

Thanks for advice, but according to this article, Jaro-Winkler is not a metric because it does not satisfy triangle equality. It seems I can use BKTree and CoverTree only with metric. Do we have better approach than LinearSearch when distance does not satisfy triangle inequality? BTW, my dataset should not be more than 10K objects.

haifengl commented 8 years ago

Since your data is not big, it is okay with linear search. I don't know your application, which may need that string distance. Just suggest to try edit distance if it makes sense in your case. It really depends on the use case. Speed is only a secondary consideration for small data.

badgersow commented 8 years ago

Many thanks for help. I am closing the issue.