sourish-rygbee / dkpro-similarity-asl

Automatically exported from code.google.com/p/dkpro-similarity-asl
0 stars 0 forks source link

CosineSimilarity: Round-off error for identical sentences #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
 1. Modify the example class (de.tudarmstadt.ukp.similarity.example.WithoutDKPro) in the following way:

  // ### BEGIN OF SNIPPET ###
  CosineSimilarity measure = new CosineSimilarity();
  List<String> lemmas1 = getTokens("This is a sentence with seven tokens");     
  List<String> lemmas2 = getTokens("This is a sentence with seven tokens");

  System.out.println(measure.getSimilarity(lemmas1, lemmas2));

  lemmas1 = getTokens("This is a sentence which results in an invalid cosine similarity score .");     
  lemmas2 = getTokens("This is a sentence which results in an invalid cosine similarity score .");

  System.out.println(measure.getSimilarity(lemmas1, lemmas2));
  // ### END OF SNIPPET ###

 2. Execute the code and check the results.

What is the expected output? What do you see instead?
 Example 1:
 - Expected: 1.0
 - Actual: 0.9999999999999999

 Example 2: 
 - Expected: 1.0
 - Actual: 1.0000000000000002

What version of the product are you using? On what operating system?
 Latest version from trunk (18.07.2013).

Please provide any additional information below.
 Java SE 1.6
 Mac OS X 10.7.5

Original issue reported on code.google.com by rouven.r...@gmail.com on 19 Jul 2013 at 7:26

GoogleCodeExporter commented 9 years ago
Sorry, I can't edit the title: It's rather a precision problem than a round-off 
error.

Original comment by rouven.r...@gmail.com on 19 Jul 2013 at 8:23

GoogleCodeExporter commented 9 years ago
Well, we could add a check for identical sentences and return 1.0 in that case, 
but it will increase the run-time in all other cases.

Is it affecting your application?
In most cases 0.9999999999999999 should be close enough to 1.0 to pass a 
equality check with a reasonable epsilon.

Original comment by torsten....@gmail.com on 19 Jul 2013 at 9:04

GoogleCodeExporter commented 9 years ago
Hi, no "0.9999999999999999" doesn't affect my application at all! I just added 
this example for completeness.
However, I assume most users (including myself) expect the score strictly 
within [0, 1], therefore 1.0000000000000002 was a problem for my application.

Original comment by rouven.r...@gmail.com on 19 Jul 2013 at 1:25

GoogleCodeExporter commented 9 years ago
Should be fixed now.

Original comment by torsten....@gmail.com on 22 Jul 2013 at 1:04

GoogleCodeExporter commented 9 years ago
The fix introduces another bug.
See the following (anonymized) example:

String textA = "1 3 4 5 6 7 8 9 3 10 7 11 .";
String textB = "2 3 12 13 5 3 7 11 14 15 3 7 .";
CosineSimilarity cosineSimilarityMeasure = new CosineSimilarity();
List<String> tokensA = getTokens(textA);
List<String> tokensB = getTokens(textB);
double similarity = cosineSimilarityMeasure.getSimilarity(tokensA, tokensB);

System.out.println(similarity); // Returns 1.0

The incorrect score 1.0 is returned because of the change in line 198 ff.

Original comment by rouven.r...@gmail.com on 19 Aug 2013 at 2:23