Closed paulirwin closed 7 years ago
I just discovered a bug that "" and null together are not treated correctly. Will update soon with additional tests for that case.
Fixed the case of passing (null, ""). Also refactored my changes to the unit tests to reduce the amount of test code, and also added tests for WeightedLevenshtein.
I regret providing this comment this late. However I don't think that treating null and empty string is a good thing. The difference between the empty string "" and any other string "hello world" can be measured. However the difference between the empty string and no string at all is of a different category.
Null pointers aren't an annoyance, rather they let you know that your input is not what you expect it to be.
@mpkorstanje The existing code wasn't handling null values consistently or even caring about them at all - it was getting as far as it could into the algorithm before letting an NPE happen. Either the methods should enforce it via an IllegalArgumentException
up front at the beginning of the method (or NPE, there is some disagreement about this) and either mention that in the JavaDoc or be checked (ugh), or coalesce it to empty and do a comparison on them. I'm fine with modifying this PR to throw one of those exceptions and list it in the JavaDoc instead of coalescing it to empty if that's the consensus here.
Agreed. The handling should be consistent for all metrics.
I would suggest using NPE. I agree it is inconsistent with the semantics of NPE but it is consistent with Objects.requireNonNull from Java 7 and Guava's Precondition.checkNotNull.
So I would consider the following behavior to be correct:
distance(null,"") // throws nullpointer exception
distance("",null); // throws nullpointer exception
distance(null,null); // throws nullpointer exception
distance("","") == 0.0
I've enforced the same behavior in Simmetrics. Most of it through the early out checks for empty strings. This avoids the overhead of actually dedicating a specific check to the nullity of the arguments and works nicely with java's confusion of NPE and IllegalArgumentException.
I can't contributed to consensus. @tdebatty is the owner, I'm merely giving my advice as to the best practices.
On my trip home I convinced myself that throwing an NPE with a meaningful message and listed in the JavaDoc would be best.
My specific use case coalesces null to empty, as I'm calculating the similarity of an optional "Address Line 2" field against a provided value, but that's my specific case - there's no good reason to impose that on others. Also null != ""
so that shouldn't evaluate to a normalized similarity of 1.
@tdebatty If you're in agreement let me know and I'll update the PR.
Hi,
Thank you all for the contributions! Based on these comments I think indeed that throwing an NPE is the best way to go...
Le ven. 20 janv. 2017 à 00:14, Paul Irwin notifications@github.com a écrit :
On my trip home I convinced myself that throwing an NPE with a meaningful message and listed in the JavaDoc would be best.
My specific use case coalesces null to empty, as I'm calculating the similarity of an optional "Address Line 2" field against a provided value, but that's my specific case - there's no good reason to impose that on others. Also null != "" so that shouldn't evaluate to a normalized similarity of 1.
@tdebatty https://github.com/tdebatty If you're in agreement let me know and I'll update the PR.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/tdebatty/java-string-similarity/pull/29#issuecomment-273928876, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1SDIT-avyyuTn0aajYq8cIwzCny2c5ks5rT-5GgaJpZM4Lnp9s .
The code to throw NPEs is now complete, I haven't yet updated the JavaDocs. I will do that soon.
Just created a new release. Should be available on maven within a few hours...
Thanks!
This PR resolves #28 for handling empty strings, but also handles null values as well, treating null values effectively as empty strings. This should avoid any unexpected NullPointerExceptions from users of the library. Additional tests have been added for these cases.
In the case of normalized similarity/distance: two null or empty values are treated as similar/zero-distance, while if one of them is null or empty it is treated as dissimilar/completely distant.
For metric and non-metric distance algorithms, all of them appeared to use the max length in the case of empty strings already, so I took that approach for all of them. I'm not 100% confident that this is correct, so let me know if any of the algorithms don't work with that approach.
Had to modify .gitignore to exclude the IntelliJ-generated files.