python / cpython

The Python programming language
https://www.python.org/
Other
61.16k stars 29.52k forks source link

Difflib creates unreasonably large diffs #118150

Open Sxderp opened 2 months ago

Sxderp commented 2 months ago

Bug report

Bug description:

#!/usr/bin/python3

import difflib

def get_lines(filename):
    with open(filename, 'r', encoding='utf8') as fd:
        return fd.readlines()

old_new = list(difflib.unified_diff(
    get_lines('external.old'),
    get_lines('external.new'),
))
new_old = list(difflib.unified_diff(
    get_lines('external.new'),
    get_lines('external.old'),
))
print('diff -u external.old external.new.json | wc -l')
print(len(old_new))
print('diff -u external.new external.old.json | wc -l')
print(len(new_old))
$ ./difftest.py
diff -u external.old.json external.new.json | wc -l
30854
diff -u external.new.json external.old.json | wc -l
26
$ diff -u external.old.json external.new.json | wc -l
26
$ diff -u external.new.json external.old.json | wc -l
26

Running on RHEL9 with Python 3.9.18.

I have a JSON file that I'm diffing and I tend to get very large diffs when lines are removed from the file. When lines are added the produced diffs are simple. The JSON file is a list of DNS records.

external.old.json external.new.json

Here are smaller files that reproduce the effect. In trying to create the smaller files I noticed that if I go too small then the diffs work fine. If you remove the "abcluster.eas.gatech.edu" from both files the diffs work. I could not get the file smaller.

small.external.old.json small.external.new.json

CPython versions tested on:

3.9

Operating systems tested on:

Linux

terryjreedy commented 2 months ago

Can you illustrate the issue with much smaller strings, perhaps 20 or fewer lines included in the code, the instead of the 340Kb files attached?

Sxderp commented 2 months ago

I updated the first post with smaller JSON. Unfortunately I couldn't reproduce the issue when I tried to make even smaller files. A size issue?

blhsing commented 2 months ago

This is because of the autojunk option that is True by default. With this option enabled SequenceMatcher would automatically consider "popular" elements (those that account for 1 + 1% of the total elements) as junk when the size of the second sequence is 200 or more. The intention is to speed up the diff by ignoring elements that repeat too much--at the cost of accuracy.

You code would work as intended if you patch SequenceMatcher with autojunk=False:

import difflib
from unittest.mock import patch
from functools import partialmethod

with patch('difflib.SequenceMatcher.__init__', partialmethod(difflib.SequenceMatcher.__init__, autojunk=False)):
    old_new = list(difflib.unified_diff(
        get_lines('small.external.old.json'),
        get_lines('small.external.new.json')
    ))

IMHO the autojunk option should be made available to all of difflib's utility functions including unified_diff, diff_bytes, ndiff, etc., or be made into a global setting in the difflib module, so one does not need to resort to patching SequenceMatcher to disable autojunk.

Sxderp commented 2 months ago

IMHO the autojunk option should be made available to all of difflib's utility functions including unified_diff, diff_bytes, ndiff, etc., or be made into a global setting in the difflib module, so one does not need to resort to patching SequenceMatcher to disable autojunk.

This sounds like a reasonable ask. I concur.

terryjreedy commented 2 months ago

Did either of you verify that autojunk=False actually solves the issue for the posted data?

@tim-one Proposal in https://github.com/python/cpython/issues/118150#issuecomment-2071713604: make SequenceMatcher autojunk option available to utility functions that use SequenceMatcher. Or make option a module global. Reasonable? Which better? Better alternative?

Sxderp commented 2 months ago
  1. It does solve the issue (and I'm using it for the time being) 1.a It is a fair bit slower. 1.b I find it strange that the autojunk errors are only triggering when lines are removed but it behaves when lines are added.
  2. I'm not sure about a global option. It is slower so I feel it could be situational. If someone knows their data isn't going to trigger the weird output they can use the speedup.
  3. I don't know much about how "junk" is determined and I know algorithms are hard but the ideal situation would be to speed up difflib without producing large diffs. I know Python is going to be slower then C, but when writing new files, using subprocess diff, and deleting the files is so much faster, I'm almost tempted to do that.

With my not-very-scientific tests ~ 5.46 autojunk = False (Unsure how much may be overhead from patch itself). ~ 0.02 default ~ 0.01 writing new files, diffing, deleting them

I don't expect to get anywhere close to diff speed and for my use-case 5 seconds doesn't matter (cron job).

terryjreedy commented 2 months ago

For experiments, you should, depending on exact install type and location, be able to (temporarily) modify difflib.py itself.

Sxderp commented 2 months ago

I copied the difflib.py from the system directory into my local directory and patched it. I'm still getting ~5s with autojunk = False. So patch was not providing a noticeable impact.

The speed, at least for my use case, isn't too big of a concern. Being able to specify autojunk=False, by some means, to difflib is the most important short-term goal.