inveniosoftware / dictdiffer

Dictdiffer is a module that helps you to diff and patch dictionaries.
https://dictdiffer.readthedocs.io
Other
838 stars 93 forks source link

possible to diff lists of dicts? #131

Open netllama opened 4 years ago

netllama commented 4 years ago

I'm trying to use dictdiffer to get the diff of two lists of dictionaries, but its not working well. It seems to get confused whenever the ordering of the dicts inside the list is not identical between the two lists, or when the number of dictionaries in one list is (slightly) different from the other list.

Is this functionality something that is currently possible and expected to work, or am I trying to accomplish something unsupported?

In case it matters, this is an example of a list of dictionaries that I'm trying to work with:

[{'prefix': '162.245.48.104/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.112/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.120/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.128/27',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.16/28',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.160/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.168/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.176/28',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.192/30',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.200/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.208/30',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.216/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.224/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.232/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.240/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.248/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.32/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.40/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.48/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.56/29',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.64/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.72/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.48.96/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.50.0/24',
  'effective_as_path_length': 1,
  'med': 6,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.112/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.120/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.128/28',
  'effective_as_path_length': 1,
  'med': 15,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']},
 {'prefix': '162.245.51.144/29',
  'effective_as_path_length': 1,
  'med': 0,
  'destination': ['abcd.jln001.norc'],
  'origin_asn': ['393467'],
  'next_hop_asn': ['393467']}]
jirikuncar commented 4 years ago

Can you provide a smaller example with current and expected results?

There are some limitations to list comparison:

netllama commented 4 years ago

Thanks. The limitations that you noted are definitely present in the data that I'm working with. It that's expected behavior right now, that's fine. I'll need to use something other than dictdiffer for the project.

mikaelho commented 4 years ago

I am confused by the statement: ”it can not detect changes in the order of elements”.

Since:

list(dictdiffer.diff([1,2],[2,1])) [('change', [0], (1, 2)), ('change', [1], (2, 1))]

Can you expand a bit?

mikaelho commented 4 years ago

Now I explained it to myself. The ideal result, at least for some use cases, would be some kind of a ”swap” operation, instead of the two ”changes”.

netllama commented 4 years ago

Yes, exactly. Sometimes, the order of dicts inside the list doesn't matter.

You should be able to reproduce my original issue by creating a second list of dicts (based on the one that I provided), and making a few minor changes:

mikaelho commented 4 years ago

I think your first issue is covered by @jirikuncar’s explanation of list diffs - everything is ”different” from the addition onward. Your second issue I think stems from the fact that dictdiffer does not compare dicts by identity, but by content - if you have shuffled the list, changing the first element, dictdiffer detects the difference in specific content within the dict, not in its identity.

If you find a better solution for your needs, please share your findings here as well.

jirikuncar commented 4 years ago

I have been experimenting with storing JSON structures in Git to do diffs on big nested objects.

You could do something like:

$ pip install git_json_tree
$ ipython
import subprocess
import git_json_tree

repo = git_json_tree.Repo('/tmp/git_json_tree')
sha_original = git_json_tree.encode(repo, ["your example goes here"])
sha_modified = git_json_tree.encode(repo, ["your modified example goes here"])

subprocess.call(['git', 'diff', sha_original, sha_modified])
danielduhh commented 4 years ago

@jirikuncar I'm running into a similar issue with comparing both lists and sets... if I have two lists A,B

list_a = ["changeA", "changeB"]
list_b = ["changeA", "changeC", "changeB"]

Dictdiff correctly finds the ADD "changeC", but is also reporting a change in index 1 from changeB to changeC. Is there a way to ignore this?

I tried converting the list into sets (since they are unordered and unindexed):

set1 = set(["changeA", "changeB"])
set2 = set(["changeA", "changeC", "changeB"])

But dictdiff is still reporting a CHANGE as

<class 'list'>: [
('add', '', [(0, {'changeC'})]), 
('change', '', ({'changeA', 'changeB'}, {'changeA', 'changeC', 'changeB'}))
]
jirikuncar commented 4 years ago

@danielduhh I have tried with lastest dictdiffer version.

In [1]: import dictdiffer

In [2]: list_a = ["changeA", "changeB"]
   ...: list_b = ["changeA", "changeC", "changeB"]

In [3]: set1 = set(["changeA", "changeB"])
   ...: set2 = set(["changeA", "changeC", "changeB"])

In [4]: list(dictdiffer.diff(list_a, list_b))
Out[4]: [('change', [1], ('changeB', 'changeC')), ('add', '', [(2, 'changeB')])]

In [5]: list(dictdiffer.diff(set1, set2))
Out[5]:
[('add', '', [(0, {'changeC'})]),
 ('change', '', ({'changeA', 'changeB'}, {'changeA', 'changeB', 'changeC'}))]

I can confirm that the result for sets looks wrong. Can you submit a PR with test case and expected results?

jirikuncar commented 4 years ago

@danielduhh can you check if https://github.com/inveniosoftware/dictdiffer/pull/133 is solving your problem?

Datamance commented 4 years ago

FYI, this will be useful for those of us wishing to implement Strategic Three Way Merge Patch for the Kubernetes API :) for things like deployments, where you can have big lists of environment variables that look like so:

{
  "name": "MEDIA_URL",
  "valueFrom": {
    "configMapKeyRef": {
      "key": "MEDIA_URL",
      "name": "my-secret"
    }
  }
}

AFAIK the kubectl cli tool uses a schema that has pre-specified "patch merge keys" to essentially treat each element of the list like a value that is keyed by a specified property within (in the case above, it'd be "name")

as of dictdiffer 0.8.1, I'm still getting the wrong results for lists of dicts like this. I don't think you could solve this in a truly generalized way, but allowing schema-based diff could be a good approach.

jirikuncar commented 4 years ago

@Datamance basically you would like to provide a custom diff function for certain keys?

Datamance commented 4 years ago

That could work, and I think you could use that as a building block for the common case of schema-based (particularly JSONSchema) diffing/patching.

jirikuncar commented 4 years ago

So basically one shoud provide mapping from dotted_node to a function with a same signature as diff:

def diff(..., key=None):
    # after dotted_node ~ :158
    if dotted_node in key:
        for item in key[dotted_node](first, second, node=node, ignore=ignore, path_limit=path_limit, expand=expand, tolerance=tolerance, dot_notation=dot_notation):
            yield item
        return
aorumbayev commented 4 years ago

Any updates on this?

jirikuncar commented 4 years ago

@aorumbayev feel free to send a PR with a fix discussed above.