jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.11k stars 1.57k forks source link

duplicate fields are dropped #1636

Closed nqzero closed 6 years ago

nqzero commented 6 years ago

json fields are not required to be unique but jq ignores duplicates, eg

echo '{"x":0,"x":1}' | jq 'to_entries'
[
  {
    "key": "x",
    "value": 1
  }
]

jq should return both pairs

pkoppstein commented 6 years ago

jq should return both pairs

What jq does is certainly permissible according to the JSON standard, and in fact it does what Douglas Crockford and others recommend -- see for example https://esdiscuss.org/topic/json-duplicate-keys

See also #1509

However, the jq streaming parser can be used to handle "duplicate keys": see https://stackoverflow.com/questions/36956590/json-fields-have-the-same-name

nqzero commented 6 years ago

certainly for many commands what jq does with duplicates seems ideal, but for to_entries it's really not

to_entries, from_entries, with_entries These functions convert between an object and an array of key-value pairs. If to_entries is passed an object, then for each k: v entry in the input, the output array includes {"key": k, "value": v}.

the other command that seems wrong is

Identity: . The absolute simplest filter is . . This is a filter that takes its input and produces it unchanged as output. That is, this is the identity operator.

though you could argue that the output formatting is stripping the fields (not the command). but that's not involved with to_entries

wtlangford commented 6 years ago

It's also important to point out that what it's doing in both of those cases is relative to how jq understood the object in the first place. While from your perspective . is modifying the input by removing the duplicates, to jq the input never had the duplicates. The parser does something RFC-compliant, by taking only one of the input keys (I think it's the last, but I'd have to confirm), and then the jq program itself never sees duplicates. And in to_entries and from_entries the same thing applies. Those both operate on JSON values as jq understands them (which is that objects do not have duplicate keys).

All this said, I see your point, but I don't see jq adding support for duplicate keys in this way. As pkoppstein said, though, depending on your use case, the streaming parser might solve your problem.

In any case, it's the parser stripping the duplicate fields (and then jq's understanding of a JSON object preventing subsequent duplicates), and the parser is RFC-compliant in its behavior. I'm not inclined to change the parser's behavior, since every JSON implementation I've ever seen does the same. If the complaint is more about using jq as a json validator/something that doesn't touch the actual JSON bytes, then point out that jq isn't really meant to be used that way.

On Sat, Mar 24, 2018 at 6:14 PM nqzero notifications@github.com wrote:

certainly for many commands what jq does with duplicates seems ideal, but for to_entries it's really not

to_entries, from_entries, with_entries These functions convert between an object and an array of key-value pairs. If to_entries is passed an object, then for each k: v entry in the input, the output array includes {"key": k, "value": v}.

the other command that seems wrong is

Identity: . The absolute simplest filter is . . This is a filter that takes its input and produces it unchanged as output. That is, this is the identity operator.

though you could argue that the output formatting is stripping the fields (not the command). but that's not involved with to_entries

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stedolan/jq/issues/1636#issuecomment-375928748, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ4V0TKRagZw5H01GaS6vMVy2KEdcy5ks5thsVfgaJpZM4S53yP .

pkoppstein commented 6 years ago

the other command that seems wrong is Identity: .

I would agree that the documentation is misleading (though it must be said that that from a certain point of view it can be defended). In any case, the topic has been addressed in the first Q: in the "Caveats" section of the jq FAQ

nqzero commented 6 years ago

even if this was "fixed" in master right now, i wouldn't be able to use it (i need tools installed on ubuntu 16.04, so i'm using grep and sed instead), so i don't have a horse in the race. that said:

pkoppstein commented 6 years ago

most parsers are converting a json object to a native object in some other language

It might help to think of jq's parser in the same way: that is, it converts (syntactically valid) JSON texts to "native" (jq) entities that have a mapping back to JSON.

Consider, for example, equality '=='. jq defines equality on JSON objects based in part on the "last key wins" principle. Thus, the proper way to interpret:

echo '{"a":0, "a":1}' | jq -c . 
{"a":1}

is that '{"a":1}' is the JSON representation of the equivalence class of all syntactically-valid JSON objects that map to the same jq object.

Perhaps another way of saying this is that jq has an "opinionated" view of JSON that includes some semantic elements (notably equality) that are not (yet?) in the JSON standards.

Please note that adding a command-line switch to force jq to preserve duplicate keys might not be so straightforward conceptually as one might think. For example, consider this sentence from the json.org specification: An object is an unordered set of name/value pairs.

One reading of this would be that {"a":0, "a":0} is invalid (as it has two occurrences of the same name/value pair); another is that it should map to {"a":0}.

Incidentally, jsonlint.com now reports duplicate keys as erroneous (SyntaxError), which suggests that what we have here resembles a storm in a teacup.

nqzero commented 6 years ago

It might help to think of jq's parser in the same way: that is, it converts (syntactically valid) JSON texts to "native" (jq) entities that have a mapping back to JSON

i'm arguing that this is an implementation detail that the user shouldn't need to know

neither json.org nor jsonlint.com is canon (or associated in any public way that i've seen to either the ecma or ietf). that said, i agree that a command line switch seems counter-productive and changing . is almost certainly a bad idea since it's a widely used command and it's output gets consumed by other tools

being opinionated isn't inherently bad, so long as you're also flexible

would give you and your users the best of both worlds

pkoppstein commented 6 years ago

@nqzero - Unfortunately you don't seem to have understood what @wtlangford and I wrote about the architecture of jq. Even if one added all_entries as you suggest, it wouldn't make any difference! The only way to have the best of both worlds would be to add a command-line switch that would change the parser, and much else besides.

My take on this is that (a) it's not very important to most JSON users; (b) the current jq maintainers won't support it in the foreseeable future (for one thing, they are averse to adding command-line switches and to anything that smacks of bloatware -- jq already has a secondary JSON parser that does not squish duplicate keys); (c) if it's important to you, then by all means fork jq; (d) maybe your fork will provide the needed impetus for the changes you want to be incorporated.

nqzero commented 6 years ago

i understand that your architecture is broken. i'm suggesting you fix that architecture, or at least acknowledge that it's not easily fixable

pkoppstein commented 6 years ago

@nqzero - As a matter of fact, there is widespread agreement that one aspect of the current architecture should be changed, because currently the jq parser maps JSON numbers to IEEE 754 64-bit values. Needless to say, this creates mayhem, but even so, the maintainers seem to be daunted by the major surgery that would be required for jq . to be a true pretty-printer of JSON numbers.

nqzero commented 6 years ago

thanks @pkoppstein - that explanation makes a lot of sense

michaelmior commented 2 years ago

Sorry if this is seen as hijacking, but I wonder if it would at least be possible to produce a warning/error when the parser detects duplicate keys? I spent more time than I would care to admit trying to understand why data appeared to be missing from a JSON file because I was viewing it with jq. I think not dealing with duplicate keys is a reasonable design decision, but it would be really helpful to have some kind of an alert that all the data in the file is not going to be accessible with jq.

torfason commented 1 month ago

Just adding my two cents that a warning/error on duplicate keys would be super nice, after having run into this issue. In my case, it was a bit more complicated, as the error happened during a slurp:

ip --json addr | jq '.[] | { (.ifname): .link_type }' | jq -s 'add'

Here, with duplicate ifname values all is well until they are moved to keys and slurped, but then they get silently dropped during in second part.