tsboh / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch
Apache License 2.0
0 stars 0 forks source link

slightly different results of cleanupSemantic/cleanupEfficiency for C++/Java vs. Python impl. #71

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What version of the product are you using? On what operating system?
SVN r103.

Please provide any additional information below.

During my port of diff_match_patch to another language I
did a comparison of the output of the other implementations,
esp. Java, C++, and Python. I ran diff_main on speedtest1.txt
and speedtest2.txt files, then various cleanup functions.

For functions diff_cleanupSemantic and diff_cleanupEfficiency
I noticed slight differences between C++/Java and Python
implementations.

I suppose the Python version is correct, because the output of
the other implementations contained some non-optimal Diffs
that were expected to get eliminated.

A patch is appended that contains fixes for C++ and Java
implementations, so that they create identical output as the
Python version. I've provided new Java test cases.
(I did not check the other languages.)

I have found two reasons for the differences:

1. In diff_cleanupSemantic (and also in diff_cleanupEfficiency),
    there is a situation where the previous equality is
    thrown away, then the pointer walks back to the preceding
    equality. A "pointer.next()" follows, then the loop
    starts over processing the equality. I think the pointer
    should actually be positioned *behind* the equality, e.g. an
    additional "pointer.next()" should be inserted after the
    walking back (Note: the existing "pointer.next()" does not
    step over the equality, it only moves the iterator's internal cursor.)

    With the current implementation, in some cases, an equality
    that has been returned to may be marked as safe, which is
    not desired, because such an equality can't be reevaluated
    anymore.

    The patch for the C++ and Java implementation inserts
    an additional pointer.next() after the while loop in
    cleanupSemantic/cleanupEfficiency to make sure pointer
    _steps over the equality_, so that it is positioned at the
    next insertion/deletion.

    Two new test cases: "Double backpass elimination".

    The reason for C++ and Java being affected in nearly
    the same way, is the similarity of Qt's QMutableListIterator
    and Java's ListIterator.

2. The second problem has something to do with the "safe"
    position handling in diff_cleanupEfficiency.

    If a replacement of an equality occurs, the equality is
    replaced by a deletion and an insertion.  If there has been
    found both an insertion and a deletion before, the position
    will be marked as safe.  Later there might be a condition
    so that the process walks back to the safe position, to
    start reevaluating Diffs. The problem here with the C++
    and Java implementations is that the safe position does
    not point to the deletion part of the replacement, but to
    the insertion part. This means, the deletion part won't
    go into the calculation of the next replacement conditions;
    thus, a replacable equality may get overlooked.

    See the test case "Safe backpass elimination".

Original issue reported on code.google.com by mt4...@googlemail.com on 6 Jun 2012 at 2:33

Attachments: