google / diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Apache License 2.0
7.42k stars 1.1k forks source link

MS Word v/s diff-match-patch comparison #6

Open bhosalevinayak opened 6 years ago

bhosalevinayak commented 6 years ago

Hello, I’m trying to compare two pieces of strings using this library. However, in some cases the returned diff is not as expected. Here’s an illustration: I compared two strings using MS Word. I created string 2 by removing a sentence from string 1 and adding another sentence to it. Comparison in MS Word correctly reflects the change as strike through/underline for removed/added text. See Fig 1 in attachment

Using diff_match_patch, I got a different comparison. See Fig. 2. Here the deletion is reflected as two separate changes and addition as another two changes. This was when I used semantic cleanup. I could get the expected result by using efficiency based cleanup (edit cost = 25) as shown in Fig. 3. However, I believe setting edit cost to 25 will also not give me desired results when there are large changes between two strings. Is there a way to achieve results like MS word for any length of string with any number of changes made in string 2? Will it be by setting a large value to edit cost? What is the max value supported ?

Diff Issue.docx [text used for comparison is included in attachment]

Thanks VinayakB

NeilFraser commented 6 years ago

That's an interesting case. Thanks for reporting it. Semantic cleanup went from 11 trival commonalities down to two, but those last two wedged the algorithm. I will look at this deeper. However it won't be fixed soon.

Keep in mind that MS Word (last time I checked) is cheating. It is recording you edit the document, which means it's not doing a diff at all. Whereas DMP is being presented with two texts 'tabula rasa'.

bhosalevinayak commented 6 years ago

NeilFraser: Thanks for the update. Did you get a chance to fix this?

jhgbrt commented 6 years ago

In fact, this is a case where MS Word does not "cheat", as @bhosalevinayak is using the 'compare functionality'. To verify:

If you want to fix this, here are some of the things to consider (warning: this may be incomplete)

(1) the diff_cleanupEfficiency method does not consider 'trivial equalities' at the end of the string (one could argue that an equality < edit cost that occurs at the end should be added to the insert/delete pair before it). (2) even if it would, the diff_cleanupMerge factors out common suffixes again, without considering 'trivial equalities' or 'sentence boundaries' at the beginning or end of the string.

I'm not sure what the proper solution would be... Another variant of a cleanup method that promotes whole sentences seems the cleanest. Would this then be something that you have to call after diff_cleanupEfficiency? Or could the diff_cleanupEfficiency somehow be extended with a parameter to consider sentence boundaries?