jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
29.63k stars 1.54k forks source link

Documentation/Feature request: Clarify expressions/scope #1326

Open lylemoffitt opened 7 years ago

lylemoffitt commented 7 years ago

Exigent Question:

At any point in a jq script what does the filter . return? It may be easy for an experienced user, but it's not clear from the documentation. Put another way: what defines an expression? What delimits scope? The answers to these questions are implied, but not explicitly or clearly stated by the documentation. It's ironic that the dot filter is referred to the "least interesting filter", because it is the key to understanding the transformation of data through the script.

To clarify, a script is the string passed as the command-line argument filter or loaded with --from-file. This is to disambiguate from the pragmatic units chained together within it, which are also called 'filters'.

Problems:

The man page doesn't really say a whole lot about parenthesis. They pretty much only show up in function signatures and in examples. Yet, they have a fundamental relationship with the dot filter, and thus a critical role in the functioning the script. Their usage should be clarified. It would also be helpful to clarify their relationship with the object constructors, [] and {}, as all three are used to create sub-expressions and return objects.

The easy thing to do here would be to just create a section where you define () as an expression operator or scope operator, and then stick all the missing explanation there. This might solve the immediate issue, but you could do a lot better. I'm trying to stick with one problem here, but in general the manual could be a lot clearer. I don't know if you're trying to intentionally hide that jq is a full-blown language, but it would certainly be a lot cleaner if you approached explaining the query language like it was the pure-function programming language it is.

Suggested Solutions:

  1. Define the operator () as a Value Constructor and put it in the Types and Values section. It constructs a value from the output of the contained expression. The only thing that would be needed to be changed about its existing functionality in order to bring it in line with the other constructor operators is that it must also work when the expression is empty. Analogous to [] and {}, this should be implemented to construct a null value.

    Example:

    For each serialized JSON input, the type construction operator should return the minimum viable value of same type when the contained expression is empty, or the result of composing the expression over the input otherwise.

    echo '<json>' | jq -c '[]'              #=> []
    echo '<json>' | jq -c '[<expr>]'        #=> [result of <expr> applied to <json>]
    echo '<json>' | jq -c '{}'              #=> {}
    echo '<json>' | jq -c '{<key>:<expr>}'  #=> {<key>: result of <expr> applied to <json>}
    echo '<json>' | jq -c '()'              #=> null
    echo '<json>' | jq -c '(<expr>)'        #=> result of <expr> applied to <json>
  2. Add a section Operator Precedence and Expression Evaluation (or something to that effect) with the following:

    1. Define how filters and operators are composed into expressions and how the expressions are applied to the input JSON string to create the output JSON string. An explicitly codified type-transform like the following (written in pseudo-Haskell) would be one way to do it and be enormously helpful in terms of reasoning about a jq script.

      -- A filter is function that accepts json and returns json
      filter      :: ( JSON ) -> JSON
      -- An operator is a function that accepts 2 filters and returns a filter
      operator    :: ( filter , filter ) -> filter
      -- An closure is a function that accepts json and a filter and returns a json
      closure     :: ( JSON , filter ) -> JSON
    2. Define operator precedence. I know it's basically just left to right and parenthesis first, but it's important to explicitly state these things. This is where the type-transform will come in handy again, because it help elucidate why different sets of operators have different semantics. For example, constructors (like [] and {}), which are called operators, are actually closures. This explains why they have totally different semantics.

    3. Define scoping rules. The effect of () on scope is briefly mentioned in the Variables sub-section, but never talked about directly. The relationship between constructors and scope is never mentioned at all. Discussion of the relationship between . and the concept of scope should also be discussed. Again, closures will help here.

nicowilliams commented 7 years ago

Hmm, OK. I suppose the docs do need some clarification. I do think that @stedolan was trying to go for an intuitive description of an intuitive language. However, jq is a rather powerful language with aspects that are not obvious at first glance.

. is always "the current input value". Always. You can add |.| in most places and... it changes nothing, because it means "produce the current input value (from the expression on the left of the pipe) to the expression on the right of the pipe".

Function arguments might be particularly confusing. It's best to think of functions as having ONE (and only one) value argument and zero, one, or more function arguments. E.g., def foo: . + .; has one value argument, while def foo(bar): . + bar; has one value argument and one function argument (bar), and outputs . + <bar applied to .>.

Parenthesis can also be used to group expressions. E.g., (1 + 2) * 3. I think this is fairly obvious, but it's true and surprising that the manual does not mention this!

Parenthesis can be important to deal with precedence issues. E.g., 1, 2 | . * 3, . * 5 could be interpreted in a number of different ways (though in only one way by jq) -- it's better to use parenthesis to avoid confusion. E.g., (1, 2) | ((. * 3), (. * 5)) or 1, (2 | ((. * 3), (. * 5))).

nicowilliams commented 7 years ago

Thanks for your input! Keep it coming. It will make jq better.

lylemoffitt commented 7 years ago

@nicowilliams Thanks for the quick response. I get what they were going for, I just felt like it kinda tripped over itself a little to get there. The language is intuitive and simple, I just with it had been explained better. I ignored jq in favor of the less powerful, but easier to grok jo, months ago exactly because of the documentation.

Keep it coming.

I definitely have more ideas, but they are more focused around enhancing the programming language aspects. I wanted to see how receptive the community is first, before going further.

pkoppstein commented 7 years ago

@lylemoffitt - I'm not sure this is relevant, but since you wrote:

I definitely have more ideas, but they are more focused around enhancing the programming language aspects.

I thought I'd mention that a jq documentation effort has just started at stackoverflow.com. Maybe it could be justified by adopt a "programming language" approach?

An entry point: https://stackoverflow.com/documentation/jq/topics

lylemoffitt commented 7 years ago

@pkoppstein - Thanks for mentioning that. Wasn't aware of that feature on stackoverflow. It isn't really what I had in mind though.

Maybe it could be justified by adopt a "programming language" approach?

I think it would be better than the current one, but that's not really my call. I'm also not saying the current approach is bad either; I just don't think it's as effective as it could be. Like sed, jq is a great CLI tool that with an embedded DSL. In sed's documentation (its man page), they took the approach of emphasizing the DSL over the CLI. This (IMO) is probably what led to the long-term success of sed as a tool, but it also has the downside of making it harder to approach. I myself only recently understood the deeper nature of sed beyond its sed -e 's///' usage in part because I found the documentation so dense. But, now that I'm over the hump, I wouldn't have it any other way.

TLDR - It's a tradeoff.

pkoppstein commented 7 years ago

@lylemoffitt - I have no idea how the jq documentation at stackoverflow.com will pan out, but I like the combination of brevity and accessibility that characterizes the current "manual", so in a way it would make sense for the more "programming language" orientation that you have in mind to have a home at stackoverflow.com, if there is to be additional documentation there.

(Currently, as you may know, the home for the more technical aspects and details is the jq wiki. Maybe you'd like to start a "jq for Programmers" page there? The potential downside of that is the risk that things could get confusing with an official tutorial, an official manual, another manual on the jq wiki, and still another manual of sorts on stackoverflow ...)

My orientation is heavily influenced by the documentation I worked on for a large proprietary language. There were three distinct volumes:

  1. Tutorial
  2. Manual (i.e. reference manual)
  3. User's Guide
lylemoffitt commented 7 years ago

@pkoppstein

modulo a few tweaks [...] brevity and accessibility

I'm inclined to agree with you here. I'm not 100% sure what the right approach is given that each has its own set of trade-offs.

the jq wiki

I hadn't actually seen the wiki before. Like most projects on GitHub, I had assumed it was empty of full of incomplete/outdated information. This one has some good information that is appropriate placed there. A "jq for Programmers" page there would probably be better than stackoverflow. Either way, it's always second-class to the reference material provided with a distribution.


Ideally, there should be a quick-reference that's just as accessible as the current man page, but aimed at more experienced users. Perhaps a good solution would be to have two separate man-pages? The current man jq could stay focused on the quick-n-dirty CLI usage, while man jq-lang could be focused on the language and.jq module documentation.

nicowilliams commented 7 years ago

@pkoppstein What's the copyright licensing associated with SO docs?

pkoppstein commented 7 years ago

@nicowilliams - As best I can tell, the rules are elaborated in Section 3 ("Subscriber Content") of http://stackexchange.com/legal. The key point seems to be "all Subscriber Content that You contribute to the Network is perpetually and irrevocably licensed to Stack Exchange under the Creative Commons Attribution Share Alike license."

My (somewhat cursory) reading is that the contributor retains copyright and is not expected to grant an exclusive license.

nicowilliams commented 7 years ago

@pkoppstein Excellent. Thanks.

nicowilliams commented 7 years ago

I've pushed a partial fix for this, 6f9646a.

fadado commented 7 years ago

In relation to operators precedence, I found this table at Rosetta code:

Precedence Operator Associativity Description
lowest | %right pipe
, %left generator
// %right specialized "or" for detecting empty streams
= |= += -= *= /= %= //= %nonassoc set component
or %left short-circuit "or"
and %left short-circuit "and"
!= == < > <= >= %nonassoc boolean tests
+ - %left polymorphic plus and minus
* / % %left polymorphic multiply, divide; mod
highest ? (none) post-fix operator for suppressing errors
fadado commented 7 years ago

@lylemoffitt

It's ironic that the dot filter is referred to the "least interesting filter", because it is the key to understanding the transformation of data through the script.

Yes, I will change "least interesting filter" with

Two important predefined filters are "." (pass), the filter that does nothing, and "empty", the filter that never produces values. The main laws for those filters and the | (bind) and , (then) operators are:

. | a  ≡  a
a | .  ≡  a

empty , a    ≡  a
a , empty    ≡  a

empty | a    ≡  empty  
a | empty    ≡  empty

a , (b , c)  ≡  (a , b) , c
(a , b) | c  ≡  (a | c) , (b | c)

By the way, for my sanity I decided to put names to all filters and operators

Filter/Op. Name
. pass
| bind
, then
[ ] values
? protect
// alternative

The manual seems to deliberately avoid naming all things!

JJOR

pkoppstein commented 7 years ago

@fadado wrote:

The manual seems to deliberately avoid naming all things!

Yes, that's one way the manual achieves a brilliant economy of expression and avoids the "cognitive burden" that comes with naming, especially if the names are potentially misleading, as is the case with "pass" for ".".

Readers can be encouraged to pronounce the single-character punctuation operators in accordance with their preferences for pronouncing the punctuation characters themselves (e.g. "dot" for ".", "pipe" for "|", and "comma" for ",").

Please note that [] is not an "operator" in the usual sense. Fundamentally, [] is the empty JSON array. The postfix use of [] is, in my opinion, best understood as a shorthand, i.e., under certain circumstances, expr | .[] can be contracted to expr[] and/or (expr)[].

The name "alternative" for "//" is appropriate as it is a two-character operator with a meaning that is unrelated to "/".

lylemoffitt commented 7 years ago

@fadado

operator precedence

That's interesting, and helpful, thanks. I was surprised to see that the alternator was right associative. Isn't it defined to evaluate left to right?

The main laws

This. This is more of the kind of thing I was talking about. Helpful, clear, concise. Even if this is alien to a normal user, it's still worth putting in, because of how innocuous it is.

lylemoffitt commented 7 years ago

@pkoppstein

one way the manual achieves a brilliant economy of expression and avoids the "cognitive burden" that comes with naming

Generally, easing cognitive burden goes hand in hand with low expressive power. The man page may come off as an easy read, but it does so at the cost of length and verbosity. If you're set on reading it, the length may not be important, but it's certainly off-putting. Part of the trade of for writing to a low bar is that, while it makes on-boarding easier, it dampens the long-term effectiveness. Now that I understand the language better, I would much rather have a normal function reference, but my only choice is to scroll through a lot of text trying to remember which section the function I'm looking for is under.

pronounce the single-character punctuation operators in accordance with their preferences

The problem with "call it whatever you want" mentality is that you lack community agreement. Especially if you want people to be able to find reference materials on stack overflow, they are going to need a common name to google. Searching for "jq slash-slash" is going to end in a bad user experience. Moreover, all of this is done in the name of bowing to fear that users will flee because you made them learn the names for things. If you structure the man page uniformly, they won't even notice the names. Once they get the formatting their eyes will just jump to the section they care about.

Please note that [] is not an "operator" in the usual sense.

I believe we are all in agreement. The man page uses the terms operator, filter, and function somewhat interchangeably. I believe, the general rule it follows is that filters have word-names, functions have word-names and explicit arguments in parens, and operators are symbols.


When it comes to learning how to use a tool, none of this complexity really matters. All you really want to know is how to grep the fields out of the stupid json. But when it comes to learning how to use a language, it's all very important. As I said before, jq's problem is that it's both. I remain with my estimation that the best approach is to split the two aspects into their own pages.

fadado commented 7 years ago

The manual seems to deliberately avoid naming all things!

Yes, that's one way the manual achieves a brilliant economy of expression and avoids...

Ok, if it is a feature and not a bug I will reframe my mind, and I can say the dot operator is like an all-pass filter...

pkoppstein commented 7 years ago

@fadado wrote:

if it is a feature and not a bug I will reframe my mind

Thanks for the willingness to see it from another perspective.

... and I can say the dot operator is like an all-pass filter...

Yes, readers of the English-language edition of the jq documentation will have no trouble understanding references of the form "the operator", where is "dot", "comma", "pipe", or "query", and writing "the dot operator" rather than "the . operator" is undoubtedly sometimes easier on the eyes.

As for describing "." as an all-pass filter --- I am wondering whether the audience who will benefit from such a description is largely the same audience who will understand https://en.wikipedia.org/wiki/All-pass_filter ?

fadado commented 7 years ago

As for describing "." as an all-pass filter --- I am wondering whether the audience who will benefit from such a description is largely the same audience who will understand https://en.wikipedia.org/wiki/All-pass_filter ?

You are rigth, but while in XSLT we say ". is the current node", or in the shell we say ". is the current working directory", what should I say in JQ?

The phrase ". is the null filter" will be ok, but null is also a type name and value; this will be a source of confusion. In SNOBOL the null string is a pattern that always matches, and has the same role as the dot filter. For example, in the following code the dot filter helps to emulate SNOBOL fence or Prolog cut:

label $fence
| F
| (. , break $fence)   # like SNOBOL fence or Prolog cut
| G

Can I say null filter? Or perhaps input value?

ghost commented 7 years ago

Have you considered "Identity filter"? It is, after all, an identity function.

wtlangford commented 7 years ago

I think identity filter is a great idea here, though, admittedly, I'm also in the group that would understand "all pass filter".

As for the manual, I like that the main manual ignores some of the language aspects in favor of brevity and clarity. It makes it easy to jump into using jq. However, I think a second man page that focuses on the language as a language is a great idea. I know there's a lot of work going on at stack overflow, but to use that in the manual likely requires us to gain licensing from the individual authors.

Anyways, I'd like to see us split out the manual into two parts, one on jq(1) (the binary, and some basic usage), and another on jq.lang(8?), (the language, maybe also builtins)

On Sun, Feb 12, 2017, 07:24 Santiago Lapresta notifications@github.com wrote:

Have you considered "Identity filter"? It is, after all, an identity function https://en.wikipedia.org/wiki/Identity_function.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stedolan/jq/issues/1326#issuecomment-279215031, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ4V2P_madeyCv_Hr2GjuKQSTzIpfKvks5rbvoVgaJpZM4LwT_k .

pkoppstein commented 7 years ago

@fadado - In explaining the identity filter, ".", it would be helpful to mention that it echoes each JSON value presented as input in turn. Indeed, in my opinion, the main area for improvement of the manual is explaining the stream-oriented aspect of jq. (See https://github.com/stedolan/jq/wiki/Advanced-Topics#streams)

lylemoffitt commented 7 years ago

I like "all pass filter" in part because of its intuitive interpretation, but it has a lot of namespace conflict, and should someone google it they'd be given a lot of misdirection.

Using "identity function" is a good choice, on par with "current working directory", and "current node". The problem is more of which analogy you want to go with. Analogizing with the shell would be the best choice IMO here, because of the synergy with explaining the pipe operator, similarity in formatting and operation.

I agree that "null filter" would probably be a source of confusion.

Using "input value" probably works, but then you also need to explain that it's largely unnecessary to provide an input value, since it's automatically interpreted/provided for you most of the time.

pkoppstein commented 7 years ago

@lylemoffitt - Obviously "all-pass filter" is clear to some, but even for those with a signal-processing background, might not the bit about phase change be a potential source of confusion? More importantly, two of the primary meanings of "pass" are:

To come to an end: To decline one's turn to bid, draw, bet, compete, or play.

(Source: https://www.ahdictionary.com/word/search.html?q=pass)

lylemoffitt commented 7 years ago

@pkoppstein -- We are in agreement. That's pretty much what I was getting at. Though, your point about "pass" is important, too. I was thinking from a more common understanding, e.g. "all things pass through it". Either way, it's probably not a good way to go.

Drawing analogies to the shell and imperative/functional languages are probably the safest bets.

nicowilliams commented 7 years ago

@fadado | is more like "call". It's actually how you call functions: .foo | bar calls bar with .foo as its .. "Bind" is more appropriate for EXP as $name | ..., since that creates a symbolic binding for the output value(s) of EXP (that is, $name refers to each value output by EXP, successively, but only one value at a time, and it is visible only to the expression to the right of the |).

nicowilliams commented 7 years ago

I do like some of the suggestions here. Certainly a table of operator precedence would be nice, and some of the "laws" that @fadado proposes would be useful to include.

I too would rather not "name everything". For now anyways.

lylemoffitt commented 7 years ago

@nicowilliams

I too would rather not "name everything". For now anyways.

I'm inclined to agree with this, as there are more important issues IMO, but the push for Stack Overflow kinda necessitates that we have common pronounceable names for all the fundamental operations in jq. The rest of the discussion about what to call them should be focused on how to explain them first, and then suggest alternative operator names only by way of analogy.

Currently, all of the functions are easily searchable, and most of the operators actually have explicit names given. But, a (perhaps) surprising number are without names. Instead, they are repeatedly referred to as "the _ operator/filter", or are given no noun at all and simply referred to by their bare symbol! This latter point is really unacceptable, and places a burden on both the manual writer and the reader. Speech is the fundamental basis of reading and understanding; if you can't pronounce a thing, then you can't leverage the language processing ability in your brain towards understanding that thing, which is effectively an inhibitor since learning is all about neural activation. To wit, what do you expect people to say when they read the symbol .[] in the following quote from the manual?

Running .[] with the input [1,2,3] will produce the numbers as three separate results, rather than as a single array.

I digress...


The following are all taken from the man page. Type denotes what noun is used with a given symbol. The suggested name attempts to find something close to what people colloquially call the given operation, while also avoiding name conflicts and providing a minimum of specificity.

Type Current Name Suggested Name
operator ? try operator
filter . dot operator
filter .foo member operator
syntax .[<string>] index operator
syntax .[2] index operator
syntax .[10:15] slice operator
N/A .[] stream operator
N/A , comma operator
operator | pipe operator

The unnamed .[]? and .foo? are not named here and should remain so, because they are really just common applications of the now so named "try operator".

nicowilliams commented 7 years ago
Type Current Name Suggested Name
operator ? try operator
special . identity operator
operator .foo object identifier-index operator
operator ."foo" object index operator
operator .[EXP] array or object index operator
operator .[EXP:EXP] array and string slice operator
operator .[] array and object value iterator operator
operator , comma, or output concatenation operator
operator | pipe or call or apply operator
syntax EXP as $name | variable or symbol value binding operator
syntax [EXP] array constructor
syntax {EXP:EXP, ...} object constructor

I'm not sure that classification as "syntax" vs. "operator" makes sense. It's all syntactic. Some of these things are "operators" in the mathematical sense, but maybe all of them are (except for ., which can be thought of as the identity function). Even the binding syntax can be thought of as an operator, one that establishes a symbolic binding.

nicowilliams commented 7 years ago

There's also ."string", and a variety of other operators. Certainly a table or two would be nice.

lylemoffitt commented 7 years ago

@nicowilliams

There's also ."string" [...]

Yup. Totally missed ."string", because it's not in the header.

[...] and a variety of other operators.

Which? I believe, all remaining operators have explicit names already provided in the manual. It's not super obvious, but it is there or in the context. Double checking, these are the exceptions:

Certainly a table or two would be nice.

A table would be nice, but I think these names should also be put in the section labels. This is clear and consistent with the other operators that are named, like Addition and Array construction.

nicowilliams commented 7 years ago

@lylemoffitt Well, there's also the array-collect operator ([EXP]), the object construction operator ({<EXP>: <EXP>, ...}).

I'm going to have to learn whether the doc system supports tables...

nicowilliams commented 7 years ago

The ; indeed is not an operator. It is a separator/terminator of sorts, as follows

lylemoffitt commented 7 years ago

@nicowilliams

I'm going to have to learn whether the doc system supports tables...

Looks like the answer is no (see ronn-format). Maybe another form would do? You could change the entry format for building the manpage to something like:

f.puts "### #{entry['title']['symbol'] -- entry['title']['name']}\n"

And change the yaml to match:

entries:
  - title: 
      - name: "Index Operator"
        symbol: "`.[EXP]`" 
    body: |
      You can also look up fields of an object using syntax like...
lylemoffitt commented 7 years ago

@nicowilliams

Well, there's also the array-collect operator ([EXP]), the object construction operator ({<EXP>: <EXP>, ...}).

I thought those were already named well enough by context, but thanks for adding them.


Side note: The rules for what constitutes an acceptable EXP in each of the above is different. For example if it's 3/2, then [3/2] is fine, even though there is no such thing as a fractional index, while { a: 3/2 } will fail to compile (citing shell quoting issues of course).

nicowilliams commented 7 years ago

Yes, there are places where not the full range of expressions is permitted, most notably the object constructor, for the subtle reason that it's impossible to avoid ambiguities in the grammar.

Thanks for checking doc support for tables. Adding that is going to be a low priority for me for now, unless someone offers a PR.

lylemoffitt commented 7 years ago

@nicowilliams -- If you're fine with my solution to the tables (or something like it), I can certainly put in that PR for you. I don't think we're set on the content yet, though.

nicowilliams commented 7 years ago

@lylemoffitt Can we get a preview of what a rendered manpage would look like?

nicowilliams commented 7 years ago

@lylemoffitt Er, actually, ronnformat does seem to support tables, since it claims that "[a]ll markdown(7) linking features are supported."

lylemoffitt commented 7 years ago

@nicowilliams

All markdown(7) linking features are supported.

That looks like they only have support for markdown's [ link text ]( link url ) and [ link text ]( #section-link ) features.

nicowilliams commented 7 years ago

@lylemoffitt Oy, yes, I misread that. But elsewhere it says:

The ronn(1) command converts text in a simple markup to UNIX manual pages.
The syntax includes all Markdown formatting features, plus conventions for
expressing the structure and various notations present in standard UNIX manpages.
nicowilliams commented 7 years ago

I tried it, and... no dice, ronn does not seem to support tables. https://github.com/rtomayko/ronn/issues/99

nicowilliams commented 7 years ago

Also, whatever is done for manpages has to work for the HTML-rendered manual as well.

nicowilliams commented 7 years ago

@lylemoffitt #1340 is a PR with some modest enhancements based on this issue and #1337.