Closed drjonnicholson closed 3 years ago
A diff like this would be nice:
A:
[{"x":1,"y":2},{"x":3,"z":4}]
B:
[{"x":3,"z":4,"a",5},{"x":1,"y":2}]
Diff:
@ [{"x":3} "a"]
+ 5
(Adding "a":5 to object "x":3 in the array-set)
But how do you tell jd which key to use when determining identity in the set? And what do you do when you have multiple sets in a complex structure with different ids?
This is where metadata on the structure would be helpful for encoding these object semantics. Like dispatch notation on EDN or type annotations in Ion. Unfortunately JSON is the lingua-franca of the Internet so such metadata has to be carried as a sidecar.
One way to handle this is to pass an option to jd which is a list of paths and how to treat them. Like this:
[
[{"x":""}]
]
(Treat the top-level array as a set whose id key is "x")
Or
[
["a",[],"b",{}]
]
(Treat the array at "b" as a set. Which is found inside a list at "a".)
E.g.
{"a":[{"b":[1,2]}]}
is equal to
{"a":[{"b":[2,1]}]}
We have a case where we have objects in arrays, which we treat as sets.
Can you give me an example?
A while ago I opened a feature request for this: https://github.com/josephburnett/jd/issues/4
In our case the structure of data is of a form like:
{
"object": {
"inserts": [{"source_id": "abcd", ...}],
"updates": [{"source_id": "efgh", ...}],
"deletes": [{"source_id": "ijkl", ...}]
},
... more like above ...
}
Where source_id
is the identity field.
I like what you're suggesting, and I think ultimately there is a need for paths. However, in this case I think a global approach is enough, and a quick win - that is, an additional setting to compliment the -set
parameter that specified the property used to identify objects.
For example, if we had a separate option like -set_ident=source_id
, or allow it to be specified after the set param itself like -set=source_id
(I have no strong feelings in either direction), then a comparison might look like this:
A:
{
"subject": {
"inserts": [{"source_id": "abcd", "name":"English", "code":"ENG"}]
}
}
B: A:
{
"subject": {
"inserts": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}]
}
}
Diff:
@ ["subject", "inserts", {"source_id":"abcd"} "name"]
- English
+ English Language
Or, since we specified the identifier as a global thing rather than path specific we don't need to remind people of the field name, we could present a simplified diff path like ["subject", "inserts", {"abcd"} "name"]
.
I prefer the format which includes the id because it make the diff self-describing:
@ ["subject", "inserts", {"source_id":"abcd"} "name"]
- English
+ English Language
I would also prefer to keep every line (except the first charater) as valid JSON so it's easy to parse. So I would avoid {"abcd"}
.
I like your suggestion to provide identifier keys with the set parameter: -set=source_id
. I would extend the suggestion to include a comma-separated list of keys, all of which will be used to identify structures in a list if they are present.
Example:
jd -set=source_id,other_id a.json b.json
{
"by source id": [{"source_id": "abcd", "name":"Polish", "code":"POL"}, {"source_id": "abcd", "name":"English", "code":"ENG"}],
"by other id": [{"other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English", "code":"ENG"}],
"by both ids": [{"source_id": "abcd", "other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English", "code":"ENG"}],
}
{
"by source id": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}, {"source_id": "abcd", "name":"Polish", "code":"POL"}],
"by other id": [{"other_id": "abcd", "name":"English Language", "code":"ENG"}, {"other_id": "abcd", "name":"Polish", "code":"POL"}],
"by both ids": [{"source_id": "abcd", "other_id": "abcd", "name":"Polish", "code":"POL"}, {"other_id": "abcd", "name":"English Language", "code":"ENG"}],
}
@ ["by source id", {"source_id", "abcd"}]
- "English"
+ "English Language"
@ ["by other id", {"other_id", "abcd"}]
- "English"
+ "English Language"
@ ["by both ids", {"source_id", "abcd", "other_id": "abcd"}]
- "English"
+ "English Language"
That all sounds good to me 👍 Your argument about the diff being self describing makes sense, especially in allowing multiple identifiers.
I'd do a PR for this but I'm unfamiliar with Go, is this something I can throw back your way for implementation?
Sure.
Thank you :)
I've been doing some design and initial coding on this. Somethings that make this an interesting change:
This is all just fine. It just changes a few assumptions in the implementation so it's taking a little while to code.
Adding the set keys feature takes us a little further down the path of layering metadata semantics onto the data. So I've also been thinking about how to express metadata in general. Some invariants I would like to preserve:
I've settled on a metadata representation that I think will satisfy these. Path elements are numbers, strings or objects. Each path element may be proceeded by a list of metadata. Each metadata element is a string "key" followed by an optional "=" and string value.
E.g.:
@ ["here","there",["multiset"],{"id":"foo"},"bar"]
+ "a"
- "b"
This will read into "here", "there" and then patch all objects with "id":"foo" at "bar".
It's important to know all the possible keys when patching through a set, even if the objects don't have them. E.g. if the flag -setkeys=id1,id2
is provided and a patch finds an object with just id1
then it will emit a path with {"id1":"foo"}
. But if that set later gets an object with id1
and id2
you must know that id2
is part of the identity.
I see two ways to handle this. I can either 1) list all possible keys in metadata, or 2) list all possible keys in the path element mapping to null
when not present.
@ [["key=id1","key=id2"],{"id1":"foo"}]
or
@ [{"id1":"foo","id2":null}]
But there is a difference between a key with a value of null
and a key which is simply not present. For that reason, I'm thinking to go with explicit metadata.
Another possibility is to list explicitly in metadata all the keys that are relevant but were not present in the path element:
@ [["key=id2"],{"id1":"foo"}]
This would produce the simple diff you want for the class structure:
@ ["classes", {"source_id", "abcd"}]
- "English"
+ "English Language"
The diff engine can infer that it's a set because of the struct path element. And it can infer that all relevant keys were present because there is no key metadata.
Hmm, very interesting questions.
I definitely don't like the second option for the reason you give. I'm never a fan of field presence/null value ambiguities.
Out of the other two, I do like the compact syntax proposed in the 3rd option, and would result in the cleanest to read diff in my use case...... and yet I see the use of being explicit. Are we limited to one option? I mean it appears to me that we could default to output a diff using the 1st (explicit) approach, but provide a --compact
option or something to reduce the paths to the form of the 3rd option.
If we did that reading in the diff would have to assume all diffs are presented as in the 3rd case - that is always union the metadata keys and the used keys to get the full list rather than expect them all to be defined in metadata.
I've pushed some partial work to master. @drjonnicholson give it a try.
a.json
:
{
"subject": {
"inserts": [{"source_id": "abcd", "name":"English", "code":"ENG"}]
}
}
b.json
:
{
"subject": {
"inserts": [{"source_id": "abcd", "name":"English Language", "code":"ENG"}]
}
}
Diff with defaults:
@ ["subject","inserts",0,"name"]
- "English"
+ "English Language"
Diff with -set
parameter:
@ ["subject","inserts",{}]
- {"code":"ENG","name":"English","source_id":"abcd"}
+ {"code":"ENG","name":"English Language","source_id":"abcd"}
Diff with -set -setkeys=source_id
parameters (this is what you want):
@ ["subject","inserts",{"source_id":"abcd"},"name"]
- "English"
+ "English Language"
This will work from the commandline jd
. The library usage has varying degrees of support. Patch is not yet implemented. But I did some work along the way to make it easier to add this kind of functionality, such as converting set and multset tests to table test, and supporting metadata at diff and patch time (not just at read time). I'll finish the patch feature so setkeys
will work round trip. And then I'm going to remove read time application of metadata. I want all metadata to applied per hunk, not globally.
@drjonnicholson are you using jd
as a binary or as a golang library?
Great stuff! I’ll give it a try on Tuesday (today’s a holiday). I’m using the jd binary, assume I’ll need to compile from source etc?
Ahh, I did go get -u github.com/josephburnett/jd
and up it popped 👍
And serious thanks on this, that works like a charm! I can now see the wood from the trees when looking at my diffs! Well done! 👍 👍
Cool! I just finished adding path semantics for path metadata. Reading JSON files and patching JsonNodes no longer accepts metadata (e.g. SET, MULTISET, SetKeys). Only the Diff method accepts metadata and it is added to the Path. Patch dispatches to list, set or multiset semantics based on path metadata when encountered.
So your diff will now look like this:
@ ["subject","inserts",["set","setkeys=source_id"],{"code":"ENG","name":"English","source_id":"abcd"},"name"]
- "English
+ "English Language"
Although Patch is a little forgiving and will imply SET when encountering {}
with no metadata. I could go a little further and imply SetKeys based on the key in the index object.
Sounds great, does that change the CLI at all out of curiosity?
The CLI should behave exactly the same. Except there is a new flag, --setkeys
, which accepts a comma-delimited list of keys by which to identify objects globally: https://github.com/josephburnett/jd/blob/1f9071c800e7572a6844c86b272c71415c2e083e/main.go#L19
There isn't a way (yet) to target a specific path wtih keys when getting a diff. Set keys are applied to all sets. I also haven't implemented set keys for multisets (bags) yet, but I don't think tha matters to you.
The diff format itself is more powerful now and does describe the exact semantics every step along the way. E.g. you can express using different keys for different sets in a diff file. This was part of the work to make diffs self-describing. You don't need to provide set keys or even set parameter when patching.
That all sounds great to me :) Thanks again @josephburnett
We have a case where we have objects in arrays, which we treat as sets.
Each of these objects has an identifier property, in our case this is
source_id
, but it could easily beid
oridentifier
, etc. It'd be really nice to be able to specify such a property name so that diffs of arrays treated as sets end up looking cleaner.Thoughts?