The diffing algorithm is not accurate

GoogleCodeExporter commented 8 years ago

While it *does* report differences, it tends to pick the largest diff that
it can (rather than more, smaller, diffs).  

Choose a large (10MB or so) text file with lots of carriage returns -- an
XML file formatted for readability will do -- and permute it with the
following algorithm (this is Ruby code; $. is the current line in the
input, and gsub! alters the current string):

  for x in STDIN
    if $. % 1000 == 0
      deleted = x 
    elsif $. % 100 == 0          
      puts deleted if deleted
    elsif $. % 10 == 0             
      x.gsub!(/>.*</, ">XY<")
      puts x
    else
      puts x
    end
  end                              

This deletes every 1000'th line, inserts the previously deleted line every
100'th line, and alters every 10'th line.   

diff_match_patch reported the following changes:

  Operation     String size
  EQUAL         339
  DELETE        13934676
  INSERT        13860601
  EQUAL         92

So, basically, it reported that the whole file had changed.  The same two
files run through GNU diff resulted in 36766 differences.  A corresponding
patch made from diff_match_patch would have resulted in a patch file almost
as big as the original file (nearly 14MB); the patch file from GNU diff was
around 3MB, or 1/4 the size. This means that that the algorithm used by
diff_match_patch produces extremely inefficient patches.

Incidentally, setting the "checklines" flag to false actually results in a
*faster*, not slower, diff, although the resulting reported differences
don't vary greatly.

This is with 20090501 with Java 1.6.0 on Ubuntu Intrepid.

Original issue reported on code.google.com by seanerussell on 7 Jun 2009 at 12:21

Merged into: #26

GoogleCodeExporter commented 8 years ago

Incidentally, I did play around with the parameters at the beginning of the 
file;
aside from causing the differencing operation to take much longer, they had no
significant change to the results.  The final settings I used were:

  // Number of seconds to map a diff before giving up.  (0 for infinity)
  public float Diff_Timeout = 40.0f;
  // Cost of an empty edit operation in terms of edit characters.
  public short Diff_EditCost = 4;
  // The size beyond which the double-ended diff activates.
  // Double-ending is twice as fast, but less accurate.
  public short Diff_DualThreshold = 32;
  // Tweak the relative importance (0.0 = accuracy, 1.0 = proximity)
  public float Match_Balance = 0.0f;
  // At what point is no match declared (0.0 = perfection, 1.0 = very loose)
  public float Match_Threshold = 0.0f;
  // The min and max cutoffs used when computing text lengths.
  public int Match_MinLength = 10;
  public int Match_MaxLength = 1000;
  // Chunk size for context length.
  public short Patch_Margin = 4;

At this point, the differencing operation (diff_main) took much longer than the 
GNU
diff algorithm in Java (http://www.bmsi.com/java/#diff); the latter produces 
results
similar to GNU diff in 26 seconds; the former produced the usual, inaccurate, 
"the
whole file changed" results in 41 seconds.

Original comment by seanerussell on 7 Jun 2009 at 12:38

GoogleCodeExporter commented 8 years ago

The accuracy of the diff is not in question (it's simply reaching the timeout 
and correctly returning a non-optimal solution).  Merging this bug with the 
performance bug since that's the real issue.

Original comment by neil.fra...@gmail.com on 26 Aug 2010 at 4:12

Changed state: Duplicate

peruginni / google-diff-match-patch

The diffing algorithm is not accurate #20