difflib pathological behavior with mixed line endings

ed371eb8-bb7e-4e3a-8e92-256831a4f9f8 commented 7 years ago

BPO	31561
Nosy	@tim-one, @rhettinger
Files	file1 file2

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'library'] title = 'difflib pathological behavior with mixed line endings' updated_at = user = 'https://bugs.python.org/MahmoudAl-Qudsi' ``` bugs.python.org fields: ```python activity = actor = 'tim.peters' assignee = 'none' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'Mahmoud Al-Qudsi' dependencies = [] files = ['47164', '47165'] hgrepos = [] issue_num = 31561 keywords = [] message_count = 6.0 messages = ['302788', '302789', '302828', '302881', '302896', '302897'] nosy_count = 3.0 nosy_names = ['tim.peters', 'rhettinger', 'Mahmoud Al-Qudsi'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue31561' versions = ['Python 3.6'] ```

ed371eb8-bb7e-4e3a-8e92-256831a4f9f8 commented 7 years ago

While using the icdiff command line interface to difflib, I ran into an interesting issue where difflib took 47 seconds to compare two simple text documents (a PHP source code file that had been refactored via phptidy).

On subsequent analysis, it turned out to be some sort of pathological behavior triggered by the presence of mixed line endings. Normalizing the line endings in both files to \r\n via unix2dos and then comparing (making no other changes) resulted in the diff calculation completing in under 2 seconds.

I have attached the documents in question (file1 and file2) to this bug report.

ed371eb8-bb7e-4e3a-8e92-256831a4f9f8 commented 7 years ago

Attaching file2

tim-one commented 7 years ago

I'm not familiar with icdiff - it's not part of the Python distribution, right? If so, you should really talk to its author(s).

If two files have different line endings, then no pair of lines between them can be equal, and the difference engine will struggle mightily to find the "best matching" pair of lines between them, over & over & over ... again. That's expected, and is expensive (roughly cubic in the number of lines).

So "don't do that" is the obvious answer. If you think it's something that comes up often enough that icdiff should do magic to worm around it, they could - with enough pain - normalize line endings to a canonical form, and then post-process the diff engine's output to say that some lines it thought were equal actually weren't. Or you maybe you want them treated as equal despite that the line endings don't match? Nobody can guess, so at best another option would need to be added to specify which. That's messy all around, so they'll probably resist.

Just like I'd mightily resist mucking up difflib to do that itself ;-) Seriously, if you're messing with files containing a mix of line endings, you're begging for surprises from all sorts of tools.

ed371eb8-bb7e-4e3a-8e92-256831a4f9f8 commented 7 years ago

@tim.peters

No, icdiff is not part of core and probably should be omitted from the remainder of this discussion.

I just checked and it's actually not a mix of line endings in each file, it's just that one file is \n and the other is \r\n

You can actually just duplicate this bug by taking any file and copying it, then executing unix2dos [file1](https://bugs.python.org/file1); dos2unix [file2](https://bugs.python.org/file2) - you'll have to perfectly "correct" files2 that difflib will struggle to handle.

(as a preface to what follows, I've written a binary diff and incremental backup utility, so I'm familiar with the intricacies and pitfalls when it comes to diffing. I have not looked at difflib's source code, however. Looking at the documentation for difflib, it's not clear whether or not it should be considered a naive binary diffing utility, since it does seem to have the concept of "lines".)

Given that _both_ input files are "correct" without line ending errors, I think the correct optimization here would be for difflib to "realize" that two chunks are "identical" but with different line endings (aka just plain different, not asking for this to be treated as a special case) but instead of going on to search for a match to either buffer, it should assume that no better match will be found later on and simply move on to the next block/chunk.

Of course, in the event where file2 has a line from file1 that is first present with a different line ending then repeated with the same line ending, difflib will not choose the correct line.. but that's probably not something worth fretting over (like you said, mixed line endings == recipe for disaster).

Of course I can understand if all this is out of the scope of difflib and not an endeavor worth taking up.

rhettinger commented 7 years ago

Of course I can understand if all this is out of the scope of difflib and not an endeavor worth taking up.

I agree with that sentiment. Data normalization for comparability belongs upstream from difflib (i.e. normalizing line-endings, unicode normalization, case folding, etc). Difflib's job is to compute a diff.

tim-one commented 7 years ago

The text/binary distinction you have in mind doesn't particularly apply to difflib: it compares sequences of hashable objects. "Text files" are typically converted by front ends to lists of strings, but, e.g., the engine is just as happy comparing tuples of floats.

File comparison interfaces typically do this at _two levels: first, viewing files as lists of strings (one string per file line). Then, when two blocks of mismatching lines are encountered, viewing the lines as sequences of characters. The only role "line endings" play in any of this is in how the _input to the difference engine is created: all decisions about how a file is broken into strings are made before the difference engine is consulted. This preprocessing can choose to normalize line endings, leave them exactly as-is (typical), or remove them entirely from the strings it presents to the difference engine - or anything else it feels like doing. The engine itself has no concept of "line termination sequences" - if there happen to be \r\n, \n, \r, or \0 substrings in strings passed to it, they're treated exactly the same as any other characters.

If the input processing creates lists of lines A and B for two files, where the files have different line-end terminators which are left in the strings, then no exact match whatsoever is possible between any line of A and a line in B. You suggest to just skip over both then, but the main text-file-comparison "front end" in difflib works hard to try to do better than that. That's "a feature", albeit a potentially expensive one. Viewing the file lines as sequences of characters, it computes a "similarity score" for _every line in A compared to _every line in B. So len(A)*len(B) such scores are computed. The pair with the highest score (assuming it exceeds a cutoff value) is taken as being the synch point, and then it can go on to show the _intra_line differences between those two lines.

That's why, e.g., given the lists of "lines":

A = ["eggrolls", "a a a", "b bb"]
B = ["ccc", "dd d", "egg rolls"]

it can (and does) tell you that the egg rolls in B was plausibly obtained from the eggrolls in A by inserting a blank. This is often much more helpful than just giving up, saying "well, no line in A matched any line in B, so we'll just say A was entirely replaced by B". That would be "correct" too - and much faster - but not really helpful.

Of course there's nothing special about the blank character in that. Exactly the same applies if the line terminators differ between the files, and input processing leaves them in the strings. difflib doesn't give up just because there are no exact line-level matches, and the same expensive "similarity score" algorithm kicks in to find the "most similar" lines despite the lack of exact matches.

Since that's a feature (albeit potentially expensive), I agree with Raymond closing this. You can, of course, avoid the expense by ensuring your files all use the same line terminator sequence to begin with. Which is the one obvious & thoroughly sane approach ;-) Alternatively, talk to the icdiff author(s). I noticed it opens files for reading in binary mode, guaranteeing that different line-end conventions will be visible. It's possible they could be talked into opening text files (or add an option to do so) using Python's "universal newline" mode, which converts all instances of \n, \r\n, and \r to \n on input. Then lines that are identical except for line-end convention would in fact appear identical to difflib, and so skip the expensive similarity computations whenever that's so.

python / cpython

difflib pathological behavior with mixed line endings #75742