Closed nqzero closed 6 years ago
jq should return both pairs
What jq does is certainly permissible according to the JSON standard, and in fact it does what Douglas Crockford and others recommend -- see for example https://esdiscuss.org/topic/json-duplicate-keys
See also #1509
However, the jq streaming parser can be used to handle "duplicate keys": see https://stackoverflow.com/questions/36956590/json-fields-have-the-same-name
certainly for many commands what jq does with duplicates seems ideal, but for to_entries
it's really not
to_entries, from_entries, with_entries These functions convert between an object and an array of key-value pairs. If to_entries is passed an object, then for each k: v entry in the input, the output array includes {"key": k, "value": v}.
the other command that seems wrong is
Identity: . The absolute simplest filter is . . This is a filter that takes its input and produces it unchanged as output. That is, this is the identity operator.
though you could argue that the output formatting is stripping the fields (not the command). but that's not involved with to_entries
It's also important to point out that what it's doing in both of those
cases is relative to how jq understood the object in the first place.
While from your perspective .
is modifying the input by removing the
duplicates, to jq the input never had the duplicates. The parser does
something RFC-compliant, by taking only one of the input keys (I think it's
the last, but I'd have to confirm), and then the jq program itself never
sees duplicates.
And in to_entries
and from_entries
the same thing applies. Those both
operate on JSON values as jq understands them (which is that objects do not
have duplicate keys).
All this said, I see your point, but I don't see jq adding support for duplicate keys in this way. As pkoppstein said, though, depending on your use case, the streaming parser might solve your problem.
In any case, it's the parser stripping the duplicate fields (and then jq's understanding of a JSON object preventing subsequent duplicates), and the parser is RFC-compliant in its behavior. I'm not inclined to change the parser's behavior, since every JSON implementation I've ever seen does the same. If the complaint is more about using jq as a json validator/something that doesn't touch the actual JSON bytes, then point out that jq isn't really meant to be used that way.
On Sat, Mar 24, 2018 at 6:14 PM nqzero notifications@github.com wrote:
certainly for many commands what jq does with duplicates seems ideal, but for to_entries it's really not
to_entries, from_entries, with_entries These functions convert between an object and an array of key-value pairs. If to_entries is passed an object, then for each k: v entry in the input, the output array includes {"key": k, "value": v}.
the other command that seems wrong is
Identity: . The absolute simplest filter is . . This is a filter that takes its input and produces it unchanged as output. That is, this is the identity operator.
though you could argue that the output formatting is stripping the fields (not the command). but that's not involved with to_entries
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stedolan/jq/issues/1636#issuecomment-375928748, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ4V0TKRagZw5H01GaS6vMVy2KEdcy5ks5thsVfgaJpZM4S53yP .
the other command that seems wrong is Identity: .
I would agree that the documentation is misleading (though it must be said that that from a certain point of view it can be defended). In any case, the topic has been addressed in the first Q: in the "Caveats" section of the jq FAQ
even if this was "fixed" in master right now, i wouldn't be able to use it (i need tools installed on ubuntu 16.04, so i'm using grep and sed instead), so i don't have a horse in the race. that said:
most parsers are converting a json object to a native object in some other language (eg, java, javascript or python) that doesn't support duplicate fields, so they're forced to make a reduction. your internal rep is a json object, which does support duplicate fields
while you do currently comply with the 8259 standard, you don't appear to (ianasl) comply with the ecma-404, and this "fix" would comply to both standards, ie the proposed behavior is also allowed by 8259
jackson and gson (the two most popular java tools) can support duplicate keys, eg: https://stackoverflow.com/questions/44886795/represent-json-with-duplicate-keys-as-a-multimap
wtlangford wrote "I'm not inclined to change the
parser's behavior, since every JSON implementation I've ever seen does the
same". even if this is true (i don't think it is, see jackson above) i can't see any disadvantage to supporting this behavior - if the user provides duplicate fields and then calls to_entries
, it seems likely that they actually want the duplicates. worst case you could leave to_entries
unchanged and add an all_entries
the current behavior is unintuitive, opinionated and cannot be overridden, quite the opposite of what i'd expect from "a lightweight and flexible command-line JSON processor"
the stack overflow answer that you've linked to (earlier in this issue and again through the link to the faq) states "For more complex outputs, that would require actually understanding how --stream is supposed to be used, which is beyond me", which doesn't inspire much confidence
i'm not using jq as a validator
if implementing this would be hard, i can understand not doing it. but none of the arguments that you've made against it being the desired behavior hold water for me
most parsers are converting a json object to a native object in some other language
It might help to think of jq's parser in the same way: that is, it converts (syntactically valid) JSON texts to "native" (jq) entities that have a mapping back to JSON.
Consider, for example, equality '=='. jq defines equality on JSON objects based in part on the "last key wins" principle. Thus, the proper way to interpret:
echo '{"a":0, "a":1}' | jq -c .
{"a":1}
is that '{"a":1}' is the JSON representation of the equivalence class of all syntactically-valid JSON objects that map to the same jq object.
Perhaps another way of saying this is that jq has an "opinionated" view of JSON that includes some semantic elements (notably equality) that are not (yet?) in the JSON standards.
Please note that adding a command-line switch to force jq to preserve duplicate keys might not be so straightforward conceptually as one might think. For example, consider this sentence from the
json.org specification: An object is an unordered set of name/value pairs.
One reading of this would be that {"a":0, "a":0}
is invalid (as it has two occurrences of the same name/value pair); another is that it should map to {"a":0}.
Incidentally, jsonlint.com now reports duplicate keys as erroneous (SyntaxError), which suggests that what we have here resembles a storm in a teacup.
It might help to think of jq's parser in the same way: that is, it converts (syntactically valid) JSON texts to "native" (jq) entities that have a mapping back to JSON
i'm arguing that this is an implementation detail that the user shouldn't need to know
neither json.org nor jsonlint.com is canon (or associated in any public way that i've seen to either the ecma or ietf). that said, i agree that a command line switch seems counter-productive and changing .
is almost certainly a bad idea since it's a widely used command and it's output gets consumed by other tools
being opinionated isn't inherently bad, so long as you're also flexible
to_entries
or adding all_entries
full_output
or an @foo
equivalentwould give you and your users the best of both worlds
@nqzero - Unfortunately you don't seem to have understood what @wtlangford and I wrote about the architecture of jq. Even if one added all_entries
as you suggest, it wouldn't make any difference! The only way to have the best of both worlds would be to add a command-line switch that would change the parser, and much else besides.
My take on this is that (a) it's not very important to most JSON users; (b) the current jq maintainers won't support it in the foreseeable future (for one thing, they are averse to adding command-line switches and to anything that smacks of bloatware -- jq already has a secondary JSON parser that does not squish duplicate keys); (c) if it's important to you, then by all means fork jq; (d) maybe your fork will provide the needed impetus for the changes you want to be incorporated.
i understand that your architecture is broken. i'm suggesting you fix that architecture, or at least acknowledge that it's not easily fixable
@nqzero - As a matter of fact, there is widespread agreement that one aspect of the current architecture should be changed, because currently the jq parser maps JSON numbers to IEEE 754 64-bit values. Needless to say, this creates mayhem, but even so, the maintainers seem to be daunted by the major surgery that would be required for jq .
to be a true pretty-printer of JSON numbers.
thanks @pkoppstein - that explanation makes a lot of sense
Sorry if this is seen as hijacking, but I wonder if it would at least be possible to produce a warning/error when the parser detects duplicate keys? I spent more time than I would care to admit trying to understand why data appeared to be missing from a JSON file because I was viewing it with jq. I think not dealing with duplicate keys is a reasonable design decision, but it would be really helpful to have some kind of an alert that all the data in the file is not going to be accessible with jq.
Just adding my two cents that a warning/error on duplicate keys would be super nice, after having run into this issue. In my case, it was a bit more complicated, as the error happened during a slurp:
ip --json addr | jq '.[] | { (.ifname): .link_type }' | jq -s 'add'
Here, with duplicate ifname values all is well until they are moved to keys and slurped, but then they get silently dropped during in second part.
json fields are not required to be unique but jq ignores duplicates, eg
jq should return both pairs