Differences in objects in sets

drjonnicholson commented 5 years ago

We have a case where we have objects in arrays, which we treat as sets.

Each of these objects has an identifier property, in our case this is source_id, but it could easily be id or identifier, etc. It'd be really nice to be able to specify such a property name so that diffs of arrays treated as sets end up looking cleaner.

Thoughts?

josephburnett commented 5 years ago

A diff like this would be nice:

A:

[{"x":1,"y":2},{"x":3,"z":4}]

B:

[{"x":3,"z":4,"a",5},{"x":1,"y":2}]

Diff:

@ [{"x":3} "a"]
+ 5

(Adding "a":5 to object "x":3 in the array-set)

But how do you tell jd which key to use when determining identity in the set? And what do you do when you have multiple sets in a complex structure with different ids?

This is where metadata on the structure would be helpful for encoding these object semantics. Like dispatch notation on EDN or type annotations in Ion. Unfortunately JSON is the lingua-franca of the Internet so such metadata has to be carried as a sidecar.

One way to handle this is to pass an option to jd which is a list of paths and how to treat them. Like this:

[
[{"x":""}]
]

(Treat the top-level array as a set whose id key is "x")

Or

[
["a",[],"b",{}]
]

(Treat the array at "b" as a set. Which is found inside a list at "a".)

E.g.

{"a":[{"b":[1,2]}]}

is equal to

{"a":[{"b":[2,1]}]}

josephburnett commented 5 years ago

We have a case where we have objects in arrays, which we treat as sets.

Can you give me an example?

josephburnett commented 5 years ago

A while ago I opened a feature request for this: https://github.com/josephburnett/jd/issues/4

drjonnicholson commented 5 years ago

In our case the structure of data is of a form like:

{
  "object": {
    "inserts": [{"source_id": "abcd", ...}],
    "updates": [{"source_id": "efgh", ...}],
    "deletes": [{"source_id": "ijkl", ...}]
  },
  ... more like above ...
}

Where source_id is the identity field.

I like what you're suggesting, and I think ultimately there is a need for paths. However, in this case I think a global approach is enough, and a quick win - that is, an additional setting to compliment the -set parameter that specified the property used to identify objects.

For example, if we had a separate option like -set_ident=source_id, or allow it to be specified after the set param itself like -set=source_id (I have no strong feelings in either direction), then a comparison might look like this:

A:

{
  "subject": {
    "inserts": [{"source_id": "abcd", "name":"English", "code":"ENG"}]
  }
}

B: A:

{
  "subject": {
    "inserts": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}]
  }
}

Diff:

@ ["subject", "inserts", {"source_id":"abcd"} "name"]
- English
+ English Language

Or, since we specified the identifier as a global thing rather than path specific we don't need to remind people of the field name, we could present a simplified diff path like ["subject", "inserts", {"abcd"} "name"].

josephburnett commented 5 years ago

I prefer the format which includes the id because it make the diff self-describing:

@ ["subject", "inserts", {"source_id":"abcd"} "name"]
- English
+ English Language

I would also prefer to keep every line (except the first charater) as valid JSON so it's easy to parse. So I would avoid {"abcd"}.

I like your suggestion to provide identifier keys with the set parameter: -set=source_id. I would extend the suggestion to include a comma-separated list of keys, all of which will be used to identify structures in a list if they are present.

Example:

jd -set=source_id,other_id a.json b.json

{
  "by source id": [{"source_id": "abcd", "name":"Polish", "code":"POL"}, {"source_id": "abcd", "name":"English", "code":"ENG"}],
  "by other id": [{"other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English", "code":"ENG"}],
  "by both ids": [{"source_id": "abcd", "other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English", "code":"ENG"}],
}

{
  "by source id": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}, {"source_id": "abcd", "name":"Polish", "code":"POL"}],
  "by other id": [{"other_id": "abcd", "name":"English Language", "code":"ENG"}, {"other_id": "abcd", "name":"Polish", "code":"POL"}],
  "by both ids": [{"source_id": "abcd", "other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English Language", "code":"ENG"}],

}

@ ["by source id", {"source_id", "abcd"}]
- "English"
+ "English Language"
@ ["by other id", {"other_id", "abcd"}]
- "English"
+ "English Language"
@ ["by both ids", {"source_id", "abcd", "other_id": "abcd"}]
- "English"
+ "English Language"

drjonnicholson commented 5 years ago

That all sounds good to me 👍 Your argument about the diff being self describing makes sense, especially in allowing multiple identifiers.

I'd do a PR for this but I'm unfamiliar with Go, is this something I can throw back your way for implementation?

josephburnett commented 5 years ago

Sure.

drjonnicholson commented 5 years ago

Thank you :)

josephburnett commented 5 years ago

I've been doing some design and initial coding on this. Somethings that make this an interesting change:

Object identity is no longer the same as content equivalence. Because we are identifying structs by ids now, two structs can be the same object but have different content.
Because you can change an object without changing its identity, you can now "patch through" an object in a set. Previously sets were always terminal because changing the object made it an entirely new object.
In a multiset there might be multiple of the same object. So it is possible to "fork" and patch multiple objects at once with a single path.

This is all just fine. It just changes a few assumptions in the implementation so it's taking a little while to code.

Adding the set keys feature takes us a little further down the path of layering metadata semantics onto the data. So I've also been thinking about how to express metadata in general. Some invariants I would like to preserve:

Diffs should be self-describing. You shouldn't have to remember exactly what flags you used to create the diff in order to apply it.
Each diff hunk should be self-describing. You should be able concatenate diffs, even if they were created with different parameters.
Diffs should be human-friendly. That's the whole reason for this library to exist. There are different JSON diff solutions out there, but I created this one to be easy to read. And familiar like a unified diff.
Every line should be valid JSON after the first character, so any metadata must be valid JSON.

I've settled on a metadata representation that I think will satisfy these. Path elements are numbers, strings or objects. Each path element may be proceeded by a list of metadata. Each metadata element is a string "key" followed by an optional "=" and string value.

E.g.:

@ ["here","there",["multiset"],{"id":"foo"},"bar"]
+ "a"
- "b"

This will read into "here", "there" and then patch all objects with "id":"foo" at "bar".

It's important to know all the possible keys when patching through a set, even if the objects don't have them. E.g. if the flag -setkeys=id1,id2 is provided and a patch finds an object with just id1 then it will emit a path with {"id1":"foo"}. But if that set later gets an object with id1 and id2 you must know that id2 is part of the identity.

I see two ways to handle this. I can either 1) list all possible keys in metadata, or 2) list all possible keys in the path element mapping to null when not present.

@ [["key=id1","key=id2"],{"id1":"foo"}]

or

@ [{"id1":"foo","id2":null}]

But there is a difference between a key with a value of null and a key which is simply not present. For that reason, I'm thinking to go with explicit metadata.

josephburnett commented 5 years ago

Another possibility is to list explicitly in metadata all the keys that are relevant but were not present in the path element:

@ [["key=id2"],{"id1":"foo"}]

This would produce the simple diff you want for the class structure:

@ ["classes", {"source_id", "abcd"}]
- "English"
+ "English Language"

The diff engine can infer that it's a set because of the struct path element. And it can infer that all relevant keys were present because there is no key metadata.

drjonnicholson commented 5 years ago

Hmm, very interesting questions.

I definitely don't like the second option for the reason you give. I'm never a fan of field presence/null value ambiguities.

Out of the other two, I do like the compact syntax proposed in the 3rd option, and would result in the cleanest to read diff in my use case...... and yet I see the use of being explicit. Are we limited to one option? I mean it appears to me that we could default to output a diff using the 1st (explicit) approach, but provide a --compact option or something to reduce the paths to the form of the 3rd option.

If we did that reading in the diff would have to assume all diffs are presented as in the 3rd case - that is always union the metadata keys and the used keys to get the full list rather than expect them all to be defined in metadata.

josephburnett commented 5 years ago

I've pushed some partial work to master. @drjonnicholson give it a try.

a.json:

{
  "subject": {
    "inserts": [{"source_id": "abcd", "name":"English", "code":"ENG"}]
  }
}

b.json:

{
  "subject": {
    "inserts": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}]
  }
}

Diff with defaults:

@ ["subject","inserts",0,"name"]
- "English"
+ "English Language"

Diff with -set parameter:

@ ["subject","inserts",{}]
- {"code":"ENG","name":"English","source_id":"abcd"}
+ {"code":"ENG","name":"English Language","source_id":"abcd"}

Diff with -set -setkeys=source_id parameters (this is what you want):

@ ["subject","inserts",{"source_id":"abcd"},"name"] 
- "English"
+ "English Language"

This will work from the commandline jd. The library usage has varying degrees of support. Patch is not yet implemented. But I did some work along the way to make it easier to add this kind of functionality, such as converting set and multset tests to table test, and supporting metadata at diff and patch time (not just at read time). I'll finish the patch feature so setkeys will work round trip. And then I'm going to remove read time application of metadata. I want all metadata to applied per hunk, not globally.

josephburnett commented 5 years ago

@drjonnicholson are you using jd as a binary or as a golang library?

drjonnicholson commented 5 years ago

Great stuff! I’ll give it a try on Tuesday (today’s a holiday). I’m using the jd binary, assume I’ll need to compile from source etc?

drjonnicholson commented 5 years ago

Ahh, I did go get -u github.com/josephburnett/jd and up it popped 👍

And serious thanks on this, that works like a charm! I can now see the wood from the trees when looking at my diffs! Well done! 👍 👍

josephburnett commented 5 years ago

Cool! I just finished adding path semantics for path metadata. Reading JSON files and patching JsonNodes no longer accepts metadata (e.g. SET, MULTISET, SetKeys). Only the Diff method accepts metadata and it is added to the Path. Patch dispatches to list, set or multiset semantics based on path metadata when encountered.

So your diff will now look like this:

@ ["subject","inserts",["set","setkeys=source_id"],{"code":"ENG","name":"English","source_id":"abcd"},"name"]
- "English
+ "English Language"

Although Patch is a little forgiving and will imply SET when encountering {} with no metadata. I could go a little further and imply SetKeys based on the key in the index object.

drjonnicholson commented 5 years ago

Sounds great, does that change the CLI at all out of curiosity?

josephburnett commented 5 years ago

The CLI should behave exactly the same. Except there is a new flag, --setkeys, which accepts a comma-delimited list of keys by which to identify objects globally: https://github.com/josephburnett/jd/blob/1f9071c800e7572a6844c86b272c71415c2e083e/main.go#L19

There isn't a way (yet) to target a specific path wtih keys when getting a diff. Set keys are applied to all sets. I also haven't implemented set keys for multisets (bags) yet, but I don't think tha matters to you.

The diff format itself is more powerful now and does describe the exact semantics every step along the way. E.g. you can express using different keys for different sets in a diff file. This was part of the work to make diffs self-describing. You don't need to provide set keys or even set parameter when patching.

drjonnicholson commented 5 years ago

That all sounds great to me :) Thanks again @josephburnett

josephburnett / jd

Differences in objects in sets #12