tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 413 forks source link

Resolves #28: null and empty value handling #29

Closed paulirwin closed 7 years ago

paulirwin commented 7 years ago

This PR resolves #28 for handling empty strings, but also handles null values as well, treating null values effectively as empty strings. This should avoid any unexpected NullPointerExceptions from users of the library. Additional tests have been added for these cases.

In the case of normalized similarity/distance: two null or empty values are treated as similar/zero-distance, while if one of them is null or empty it is treated as dissimilar/completely distant.

For metric and non-metric distance algorithms, all of them appeared to use the max length in the case of empty strings already, so I took that approach for all of them. I'm not 100% confident that this is correct, so let me know if any of the algorithms don't work with that approach.

Had to modify .gitignore to exclude the IntelliJ-generated files.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+3.7%) to 73.324% when pulling d8ef94f06b18efa8bd9037b1a93ec3ad7ee23d17 on paulirwin:issue/28 into 471682eaa78915ad5912fd7c85b045e4e15f3dbe on tdebatty:master.

paulirwin commented 7 years ago

I just discovered a bug that "" and null together are not treated correctly. Will update soon with additional tests for that case.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+6.6%) to 76.235% when pulling b517f003239c98a848895d761f65727ce0c14213 on paulirwin:issue/28 into 471682eaa78915ad5912fd7c85b045e4e15f3dbe on tdebatty:master.

paulirwin commented 7 years ago

Fixed the case of passing (null, ""). Also refactored my changes to the unit tests to reduce the amount of test code, and also added tests for WeightedLevenshtein.

mpkorstanje commented 7 years ago

I regret providing this comment this late. However I don't think that treating null and empty string is a good thing. The difference between the empty string "" and any other string "hello world" can be measured. However the difference between the empty string and no string at all is of a different category.

Null pointers aren't an annoyance, rather they let you know that your input is not what you expect it to be.

paulirwin commented 7 years ago

@mpkorstanje The existing code wasn't handling null values consistently or even caring about them at all - it was getting as far as it could into the algorithm before letting an NPE happen. Either the methods should enforce it via an IllegalArgumentException up front at the beginning of the method (or NPE, there is some disagreement about this) and either mention that in the JavaDoc or be checked (ugh), or coalesce it to empty and do a comparison on them. I'm fine with modifying this PR to throw one of those exceptions and list it in the JavaDoc instead of coalescing it to empty if that's the consensus here.

mpkorstanje commented 7 years ago

Agreed. The handling should be consistent for all metrics.

I would suggest using NPE. I agree it is inconsistent with the semantics of NPE but it is consistent with Objects.requireNonNull from Java 7 and Guava's Precondition.checkNotNull.

So I would consider the following behavior to be correct:

distance(null,"") // throws nullpointer exception
distance("",null); // throws nullpointer exception
distance(null,null); // throws nullpointer exception
distance("","") == 0.0

I've enforced the same behavior in Simmetrics. Most of it through the early out checks for empty strings. This avoids the overhead of actually dedicating a specific check to the nullity of the arguments and works nicely with java's confusion of NPE and IllegalArgumentException.

I can't contributed to consensus. @tdebatty is the owner, I'm merely giving my advice as to the best practices.

paulirwin commented 7 years ago

On my trip home I convinced myself that throwing an NPE with a meaningful message and listed in the JavaDoc would be best.

My specific use case coalesces null to empty, as I'm calculating the similarity of an optional "Address Line 2" field against a provided value, but that's my specific case - there's no good reason to impose that on others. Also null != "" so that shouldn't evaluate to a normalized similarity of 1.

@tdebatty If you're in agreement let me know and I'll update the PR.

tdebatty commented 7 years ago

Hi,

Thank you all for the contributions! Based on these comments I think indeed that throwing an NPE is the best way to go...

Le ven. 20 janv. 2017 à 00:14, Paul Irwin notifications@github.com a écrit :

On my trip home I convinced myself that throwing an NPE with a meaningful message and listed in the JavaDoc would be best.

My specific use case coalesces null to empty, as I'm calculating the similarity of an optional "Address Line 2" field against a provided value, but that's my specific case - there's no good reason to impose that on others. Also null != "" so that shouldn't evaluate to a normalized similarity of 1.

@tdebatty https://github.com/tdebatty If you're in agreement let me know and I'll update the PR.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/tdebatty/java-string-similarity/pull/29#issuecomment-273928876, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1SDIT-avyyuTn0aajYq8cIwzCny2c5ks5rT-5GgaJpZM4Lnp9s .

coveralls commented 7 years ago

Coverage Status

Coverage increased (+8.5%) to 78.17% when pulling 6135d392515b3b48db2d024316b2e3b4a30f8d35 on paulirwin:issue/28 into 471682eaa78915ad5912fd7c85b045e4e15f3dbe on tdebatty:master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+8.9%) to 78.507% when pulling 75142c0ce40a72dadb447bd92c3ebd80b57127c8 on paulirwin:issue/28 into 471682eaa78915ad5912fd7c85b045e4e15f3dbe on tdebatty:master.

paulirwin commented 7 years ago

The code to throw NPEs is now complete, I haven't yet updated the JavaDocs. I will do that soon.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+8.7%) to 78.361% when pulling 3406d2cfd853ca385090d144eed117f636ebd304 on paulirwin:issue/28 into 471682eaa78915ad5912fd7c85b045e4e15f3dbe on tdebatty:master.

tdebatty commented 7 years ago

Just created a new release. Should be available on maven within a few hours...

Thanks!