Open Sxderp opened 2 months ago
Can you illustrate the issue with much smaller strings, perhaps 20 or fewer lines included in the code, the instead of the 340Kb files attached?
I updated the first post with smaller JSON. Unfortunately I couldn't reproduce the issue when I tried to make even smaller files. A size issue?
This is because of the autojunk
option that is True
by default. With this option enabled SequenceMatcher
would automatically consider "popular" elements (those that account for 1 + 1% of the total elements) as junk when the size of the second sequence is 200 or more. The intention is to speed up the diff by ignoring elements that repeat too much--at the cost of accuracy.
You code would work as intended if you patch SequenceMatcher
with autojunk=False
:
import difflib
from unittest.mock import patch
from functools import partialmethod
with patch('difflib.SequenceMatcher.__init__', partialmethod(difflib.SequenceMatcher.__init__, autojunk=False)):
old_new = list(difflib.unified_diff(
get_lines('small.external.old.json'),
get_lines('small.external.new.json')
))
IMHO the autojunk
option should be made available to all of difflib
's utility functions including unified_diff
, diff_bytes
, ndiff
, etc., or be made into a global setting in the difflib
module, so one does not need to resort to patching SequenceMatcher
to disable autojunk
.
IMHO the autojunk option should be made available to all of difflib's utility functions including unified_diff, diff_bytes, ndiff, etc., or be made into a global setting in the difflib module, so one does not need to resort to patching SequenceMatcher to disable autojunk.
This sounds like a reasonable ask. I concur.
Did either of you verify that autojunk=False actually solves the issue for the posted data?
@tim-one Proposal in https://github.com/python/cpython/issues/118150#issuecomment-2071713604: make SequenceMatcher autojunk
option available to utility functions that use SequenceMatcher. Or make option a module global. Reasonable? Which better? Better alternative?
With my not-very-scientific tests ~ 5.46 autojunk = False (Unsure how much may be overhead from patch itself). ~ 0.02 default ~ 0.01 writing new files, diffing, deleting them
I don't expect to get anywhere close to diff speed and for my use-case 5 seconds doesn't matter (cron job).
For experiments, you should, depending on exact install type and location, be able to (temporarily) modify difflib.py itself.
I copied the difflib.py
from the system directory into my local directory and patched it. I'm still getting ~5s with autojunk = False. So patch was not providing a noticeable impact.
The speed, at least for my use case, isn't too big of a concern. Being able to specify autojunk=False, by some means, to difflib is the most important short-term goal.
Bug report
Bug description:
Running on RHEL9 with Python 3.9.18.
I have a JSON file that I'm diffing and I tend to get very large diffs when lines are removed from the file. When lines are added the produced diffs are simple. The JSON file is a list of DNS records.
external.old.json external.new.json
Here are smaller files that reproduce the effect. In trying to create the smaller files I noticed that if I go too small then the diffs work fine. If you remove the "abcluster.eas.gatech.edu" from both files the diffs work. I could not get the file smaller.
small.external.old.json small.external.new.json
CPython versions tested on:
3.9
Operating systems tested on:
Linux