ihoglen / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch
Apache License 2.0
0 stars 0 forks source link

Sub-optimal output from diff #57

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Ran the following Python code:  

  from diff_match_patch import diff_match_patch
  dmp = diff_match_patch()

  text1="A.A.F.z.z"
  text2="x.x.F.A.A.F"

  diff = dmp.diff_main( text1, text2, False )
  print diff

What is the expected output? What do you see instead?

Expected:
  [(1, 'x.x.F.'), (0, 'A.A.F'), (-1, '.z.z')]

Actual:
  [(-1, 'A'), (1, 'x'), (0, '.'), (-1, 'A'), (1, 'x'), (0, '.F.'), (-1, 'z'), (1, 'A.A'), (0, '.'), (-1, 'z'), (1, 'F')]

What version of the product are you using? On what operating system?
Latest diff-match-patch checked out from SVN (rev 97). Mac OS X.

Please provide any additional information below.
The expected output seems easier to read. The actual output has the same length 
D-path (5) as the expected output and Myers' algorithm seems to be finding that 
one first. So I'm not sure if this is a bug in diff-match-patch or a 
performance trade-off?

Original issue reported on code.google.com by m...@dixon.se on 3 Nov 2011 at 1:19

GoogleCodeExporter commented 8 years ago
Behaviour confirmed.  As you pointed out, the two options are tied as the 
shortest possible solution.  In this case it happens it finds the ugly one 
first.

If you want an output that is easy to read, the intent is to pass the raw diff 
to diff_cleanupSemantic() which has some reasonably effective heuristics to 
clean up the results.  However in this case it fails to detect the commonality. 
 It is worth noting that if one reverses text1 and text2 it behaves perfectly.

The issue appears to be an oversight in diff_cleanupSemantic().  Currently it 
checks for this case:
# Find any overlaps between deletions and insertions.
# e.g: <del>abcxxx</del><ins>xxxdef</ins>
#   -> <del>abc</del>xxx<ins>def</ins>

The case of
<del>xxxabc</del><ins>xxxdef</ins>
and
<del>abcxxx</del><ins>defxxx</ins>
is already handled by the diff algorithm itself.

What's missing is this heuristic:
<del>xxxabc</del><ins>defxxx</ins>
 -> <ins>def</ins>xxx<del>abc</del>

Working on it...
[Nice bug report BTW.  You know your stuff.]

Original comment by neil.fra...@gmail.com on 3 Nov 2011 at 6:13

GoogleCodeExporter commented 8 years ago
Code complete.  Out for review...

Original comment by neil.fra...@gmail.com on 4 Nov 2011 at 2:06

GoogleCodeExporter commented 8 years ago
...complete.  Thanks for pointing out this sub-optimal output!

Original comment by neil.fra...@gmail.com on 9 Nov 2011 at 9:33