Closed nicowilliams closed 9 years ago
The handles
branch of my github clone of jq has _read
, read
, _fopen
, and _popen
builtins; write
is still missing. Also in that branch are some new command-line options.
Now that while
and foreach
are in, and that the C-coded builtins take a jq_state *
, I have my I/O code rebased and working. Questions for all lurkers and active participants:
eof
?read
?fopen
?popen
?Note that for write
the only sensible thing is for the file handle to be passed as an argument, with the values to write being inputs.
I tend to dislike not using inputs in builtins, partly because arguments are closures that generate values, while input is always just a value, and partly because it's a bit strange to ignore inputs (which is what all but write
must do if the handle is an argument). Also, since most uses of read
and eof
will be in expressions where the file handle is produced by an earlier expression, it seems silly to make the caller pass it as an argument consisting of the expression .
... But the asymmetry relative to write
does bother me... But then, maybe write
should take a generator expression whose outputs are to be written to the handle specified as write
's input. Both forms are useful, but handle-as-argument is more useful, I think.
Also, if read
and eof
take the file handle as an input then I can make read/0
and eof/0
operate on the stdin file handle, which is nice, and I could even make write/0
use stdout. Food for thought.
One compromise would be to pass the handle and the flags (see below) as one argument object, something like read({"handle":stdin, "raw":true})
, and if the handle is not in the flags then use the input value as the handle. Too much polymorphism can be confusing though.
FYI, in my branch the way things work is:
main()
adds whatever I/O handles it wants to the jq VM, and it will always add the real stdin, stdout, and stderr as such;main()
says so the jq program won't get to open any files or execute any programs;--allow-open
--alow-write
--allow-exec
--args
(remaining arguments are strings, not input files)I might also add --open-files
meaning all the remaining arguments are files to be made available as open file handles.
read
and write
will take flags for dealing with whether to do "raw" or cooked (JSON text) I/Owrite
will also take flags for dealing with whether to use color, whether to use compact encoding, ...read
has a flag for "slurp", but that seems silly: callers could just as well collect the outputs into an array (or string) using []
and/or reduce
.read
is a generator@stedolan @slapresta @wtlangford @pkoppstein Thoughts?
@nicowilliams asked:
Thoughts?
The case for write(handle) is very strong, and I think the symmetry argument that write(handle) should be matched with read(handle) is persuasive, especially when coupled with the fact that read
is very much like range
. There is also something to be said for familiarity. The fact that read
is not strictly functional strikes me as another reason for preferring read(handle).
Does choosing read(handle) and write(handle) have any implications for eof
? I think it does, if only for consistency and familiarity, but I realize those may not be compelling. In fact, without not/1, it's difficult to make a compelling case for eof/1. (So there's another reason for adding not/1 to the builtins :0)
Your points about being able to define read/0 as read(stdin), etc, clinch the argument for read(handle), write(handle), and eof(handle).
Since popen can act like a generator, and since its results are not strictly a function of the command string, I think the case for popen(COMMAND) is also very strong.
That just leaves fopen
. Maybe that is an indicator that some thinking-outside-the-box is needed here.
popen
and fopen
are not generators. They output a handle for use with
read/write/eof.
Whether read is functional or not depends on whether the thing being read from is part of jq's functional world -- we could be reading from a co-routine someday). I don't think that has anything to do with whether the handle is an argument...
Suppose all file handles had to be created by main()
, then we could give
each a name, as if it were a function. We could even pre-define N such
functions which, if not bound to an open file, are like /dev/null, or raise
errors when used -- we could even arrange to have as many such functions as
needed. This takes file handles out of the equation as an argument or
input. Internally these functions might be closures closing on handle
values and then call internal C-coded builtins with those handles. Or they
might be implemented in some way such that there's no way to use those
handles in any naughty way.
@nicowilliams wrote:
popen
andfopen
are not generators. They output a handle for use with read/write/eof.
So the resolution of all these issues is staring us in the face: jq does not need to have handles! Or rather, since jq now has foreach/break
, it seems to me there is no longer a sufficiently compelling reason to have them in jq. If I've overlooked something, then let's try to address the issue without introducing a new jq type.
If handles are introduced, then something like jqtype would also be needed, as I expect fopen|type will always just be null.
@pkoppstein
popen
and open
(or fopen
) could generate for reading, and consume generators' outputs for writing. This is true. However that greatly limits the power of the system as it's then not possible to do things like what the Unix paste(1)
command does.
Having handles as defs would allow that, but it'd still be limiting as choosing a handle to read from / write to in a data dependent way then requires using if ... then ... else ... end
.
Having handles as names indexing a set of open handles (like POSIX, for example) works much better, but then we have to decide whether handles are inputs or arguments and all that jazz.
We could do a combination of some or all of these. readf
-> generate outputs by reading the file named by the input, while read
-> reads from the handle -- or we could say that handles are non-string types so that read
opens a file if the input is a string else interprets the input as a handle.
There's many options. I think I'd rather not close the door to any of it yet. But I want to avoid confusion too.
@nicowilliams wrote:
.... it's then not possible to do things like what the Unix paste(1) command does.
paste(1)
??? That just takes file names as arguments. jq's paste could take both file names and streams as inputs, so I'm afraid I don't understand the point you're making.
Anyway, I'm glad you're not closing the door on alternatives to introducing a new jq type.
@pkoppstein I want to be able to write:
while(($handle_0 | eof | not) and ($handle_1 | eof | not)) |
[($handle_0|read), ($handle_1|read)] | write($stdout)
This must be online. None of this slurp one file, then the other, then paste.
EDIT: Here I'm assuming read
is not a generator.
EDIT: If read
is a generator, there's always limit(1; read)
.
I'm leaning towards: handles are arguments, always, except that they are outputs for handle-creating functions, of course.
@nicowilliams -- How about read( ARRAY_OF_RESOURCES ) as a generator of [ line1, line2, ... ] until all resources have been exhausted?
Alternation this way is not the only way that one might want to do it.
Ergo, in the venerable tradition of awk: getline(RESOURCE)
.
I'm not really sure that read and write and filehandles are the correct abstractions. Almost all usage I've seen in simple scripts (the sort of thing that you might do with jq) are about reading in an entire file (line by line or JSON-object by JSON-object) or dumping out a pile of data.
Low-level read/write calls might be useful in some special scenarios, but the functions that I think would be useful most of the time would be something that takes a filename and produces its contents line by line, and something that takes a filename and writes its input to that file. Explicit opening/reading/writing/closing seems a bit too much effort for a one-line jq script.
@stedolan I want to be able to read values in an online fashion from different file handles, particularly stdin. And I want to be able to read raw and cooked from the same handle even.
@stedolan - Thank you for chiming in. And thank you for jq
!
As you must have gathered from recent developments and discussions, there are a bunch of us out here in the Cloud who really like jq
and would like it to continue to evolve in the direction of more functionality. My own view is that jq
should (at least) be able to do whatever people usually do with sed/awk/grep -- both because that's more or less what the jq
splash page implies, but more importantly because the world needs a JSON-oriented tool that's as useful as sed/awk/grep and as sleek and elegant as jq.
That's why I'm quite comfortable with having the equivalent of awk's getline
, which gives @nicowilliams the functionality he's looking for but without the C-style handles.
(If there were a proposal for JSON-object-style handles, I'd be all ears :-)
@stedolan I agree that high-level generators that take file names as input and which then read an output their contents is a desirable goal. I'll add those.
This, then, is the design:
stdin
, stdout
, stderr
will be builtins that do the obvious thing, with stdin
generating results from stdin. All will have a form with an argument modifying how the I/O is done: raw vs JSON text, and for output whether to colorize, ...
I trust this is uncontroversial.
read
/write
/eof
will take a handle as argument (and ignore their inputs), read
will be a generatorfopen
/popen
will generate open file handles; on backtrack they will freopen /dev/null for them (these will only work if --allow-read
/--allow-write
/--allow-exec
are given on the command-line)fopen
/popen
that generate data from them instead of handles when opening for read (these are the utilities that @stedolan indicates he'd like to see)(@stedolan is not a fan of adding more command-line arguments, but we need some if we're going to allow opening files from within jq programs, as today jq is a very nice little sandbox -- I don't want to surprise anyone by making it not a sandbox by default! In any case, we'll be able to rationalize other options away with I/O primitives, though we might only do that in a jq2
executable.)
--open-read NAME FILE
, --open-write NAME FILE
, --open-append NAME FILE
, --fdopen NAME FD
command-line arguments which will create the handles (e.g., if you don't want to --allow-*
) as builtins. E.g., jq -n --open-read foo /tmp/foo 'foo'
will read everything in /tmp/foo
, as if one had run jq . /tmp/foo
.stdout
, stderr
, and write
that takes a generator as argument may be added as well; these will write a generator argument's outputsIn all cases the readers will generate. Callers that want one item should call first(stdin)
or the like, where first
will be defined much like limit
.
In the meantime, could we please have system/1
(like awk)? This would resolve (or at least mitigate) several outstanding issues (notably #147, #492, #503).
Whether it is a generator or returns an array does not much matter to me; and if it is reimplemented later using a more generic framework, so be it. Thank you.
@pkoppstein I'd really like to see something like awk
's system
. That would at least solve a range of formatting problems for me, where jq
's built-in format strings fall short.
With a streaming parser option on deck, a way to read from stdin
in a jq program is becoming urgent.
A thought occurs: reading from stdin (or even a file handle passed by the C program) is not really appropriate, in part because in an invocation like curl ... | jq . file0 file1 .. fileN
there are N+1
inputs, not a single file handle. And interleaving with jq's reading from them can even be desirable.
A first step might be to add an input
and inputs
builtins that use a C caller-provided callback to read one more input (input
) or stream all inputs inputs
.
Superseded by #925.
read
fits in well enough (see thehandles
branch of my github clone of jq).open
andwrite
should probably require jq_compile*() to provide more control over whether a jq program can call such builtins... because one very nice property of jq so far is that its programs are strictly filters, referentially transparent, with no side-effects -- nice little sandboxes. It should remain that way by default, IMO.