jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.2k stars 1.57k forks source link

I/O primitives #255

Closed nicowilliams closed 9 years ago

nicowilliams commented 10 years ago

read fits in well enough (see the handles branch of my github clone of jq).

open and write should probably require jq_compile*() to provide more control over whether a jq program can call such builtins... because one very nice property of jq so far is that its programs are strictly filters, referentially transparent, with no side-effects -- nice little sandboxes. It should remain that way by default, IMO.

nicowilliams commented 10 years ago

The handles branch of my github clone of jq has _read, read, _fopen, and _popen builtins; write is still missing. Also in that branch are some new command-line options.

nicowilliams commented 10 years ago

Now that while and foreach are in, and that the C-coded builtins take a jq_state *, I have my I/O code rebased and working. Questions for all lurkers and active participants:

Note that for write the only sensible thing is for the file handle to be passed as an argument, with the values to write being inputs.

I tend to dislike not using inputs in builtins, partly because arguments are closures that generate values, while input is always just a value, and partly because it's a bit strange to ignore inputs (which is what all but write must do if the handle is an argument). Also, since most uses of read and eof will be in expressions where the file handle is produced by an earlier expression, it seems silly to make the caller pass it as an argument consisting of the expression .... But the asymmetry relative to write does bother me... But then, maybe write should take a generator expression whose outputs are to be written to the handle specified as write's input. Both forms are useful, but handle-as-argument is more useful, I think.

Also, if read and eof take the file handle as an input then I can make read/0 and eof/0 operate on the stdin file handle, which is nice, and I could even make write/0 use stdout. Food for thought.

One compromise would be to pass the handle and the flags (see below) as one argument object, something like read({"handle":stdin, "raw":true}), and if the handle is not in the flags then use the input value as the handle. Too much polymorphism can be confusing though.

FYI, in my branch the way things work is:

I might also add --open-files meaning all the remaining arguments are files to be made available as open file handles.

nicowilliams commented 10 years ago

@stedolan @slapresta @wtlangford @pkoppstein Thoughts?

pkoppstein commented 10 years ago

@nicowilliams asked:

Thoughts?

The case for write(handle) is very strong, and I think the symmetry argument that write(handle) should be matched with read(handle) is persuasive, especially when coupled with the fact that read is very much like range. There is also something to be said for familiarity. The fact that read is not strictly functional strikes me as another reason for preferring read(handle).

Does choosing read(handle) and write(handle) have any implications for eof? I think it does, if only for consistency and familiarity, but I realize those may not be compelling. In fact, without not/1, it's difficult to make a compelling case for eof/1. (So there's another reason for adding not/1 to the builtins :0)

Your points about being able to define read/0 as read(stdin), etc, clinch the argument for read(handle), write(handle), and eof(handle).

Since popen can act like a generator, and since its results are not strictly a function of the command string, I think the case for popen(COMMAND) is also very strong.

That just leaves fopen. Maybe that is an indicator that some thinking-outside-the-box is needed here.

nicowilliams commented 10 years ago

popen and fopen are not generators. They output a handle for use with read/write/eof.

nicowilliams commented 10 years ago

Whether read is functional or not depends on whether the thing being read from is part of jq's functional world -- we could be reading from a co-routine someday). I don't think that has anything to do with whether the handle is an argument...

Suppose all file handles had to be created by main(), then we could give each a name, as if it were a function. We could even pre-define N such functions which, if not bound to an open file, are like /dev/null, or raise errors when used -- we could even arrange to have as many such functions as needed. This takes file handles out of the equation as an argument or input. Internally these functions might be closures closing on handle values and then call internal C-coded builtins with those handles. Or they might be implemented in some way such that there's no way to use those handles in any naughty way.

pkoppstein commented 10 years ago

@nicowilliams wrote:

popen and fopen are not generators. They output a handle for use with read/write/eof.

So the resolution of all these issues is staring us in the face: jq does not need to have handles! Or rather, since jq now has foreach/break, it seems to me there is no longer a sufficiently compelling reason to have them in jq. If I've overlooked something, then let's try to address the issue without introducing a new jq type.

If handles are introduced, then something like jqtype would also be needed, as I expect fopen|type will always just be null.

nicowilliams commented 10 years ago

@pkoppstein

popen and open (or fopen) could generate for reading, and consume generators' outputs for writing. This is true. However that greatly limits the power of the system as it's then not possible to do things like what the Unix paste(1) command does.

Having handles as defs would allow that, but it'd still be limiting as choosing a handle to read from / write to in a data dependent way then requires using if ... then ... else ... end.

Having handles as names indexing a set of open handles (like POSIX, for example) works much better, but then we have to decide whether handles are inputs or arguments and all that jazz.

We could do a combination of some or all of these. readf -> generate outputs by reading the file named by the input, while read -> reads from the handle -- or we could say that handles are non-string types so that read opens a file if the input is a string else interprets the input as a handle.

There's many options. I think I'd rather not close the door to any of it yet. But I want to avoid confusion too.

pkoppstein commented 10 years ago

@nicowilliams wrote:

.... it's then not possible to do things like what the Unix paste(1) command does.

paste(1)??? That just takes file names as arguments. jq's paste could take both file names and streams as inputs, so I'm afraid I don't understand the point you're making.

Anyway, I'm glad you're not closing the door on alternatives to introducing a new jq type.

nicowilliams commented 10 years ago

@pkoppstein I want to be able to write:

while(($handle_0 | eof | not) and ($handle_1 | eof | not)) |
    [($handle_0|read), ($handle_1|read)] | write($stdout)

This must be online. None of this slurp one file, then the other, then paste.

EDIT: Here I'm assuming read is not a generator. EDIT: If read is a generator, there's always limit(1; read).

nicowilliams commented 10 years ago

I'm leaning towards: handles are arguments, always, except that they are outputs for handle-creating functions, of course.

pkoppstein commented 10 years ago

@nicowilliams -- How about read( ARRAY_OF_RESOURCES ) as a generator of [ line1, line2, ... ] until all resources have been exhausted?

nicowilliams commented 10 years ago

Alternation this way is not the only way that one might want to do it.

pkoppstein commented 10 years ago

Ergo, in the venerable tradition of awk: getline(RESOURCE).

stedolan commented 10 years ago

I'm not really sure that read and write and filehandles are the correct abstractions. Almost all usage I've seen in simple scripts (the sort of thing that you might do with jq) are about reading in an entire file (line by line or JSON-object by JSON-object) or dumping out a pile of data.

Low-level read/write calls might be useful in some special scenarios, but the functions that I think would be useful most of the time would be something that takes a filename and produces its contents line by line, and something that takes a filename and writes its input to that file. Explicit opening/reading/writing/closing seems a bit too much effort for a one-line jq script.

nicowilliams commented 10 years ago

@stedolan I want to be able to read values in an online fashion from different file handles, particularly stdin. And I want to be able to read raw and cooked from the same handle even.

pkoppstein commented 10 years ago

@stedolan - Thank you for chiming in. And thank you for jq!

As you must have gathered from recent developments and discussions, there are a bunch of us out here in the Cloud who really like jq and would like it to continue to evolve in the direction of more functionality. My own view is that jq should (at least) be able to do whatever people usually do with sed/awk/grep -- both because that's more or less what the jq splash page implies, but more importantly because the world needs a JSON-oriented tool that's as useful as sed/awk/grep and as sleek and elegant as jq.

That's why I'm quite comfortable with having the equivalent of awk's getline, which gives @nicowilliams the functionality he's looking for but without the C-style handles.

(If there were a proposal for JSON-object-style handles, I'd be all ears :-)

nicowilliams commented 10 years ago

@stedolan I agree that high-level generators that take file names as input and which then read an output their contents is a desirable goal. I'll add those.

This, then, is the design:

(@stedolan is not a fan of adding more command-line arguments, but we need some if we're going to allow opening files from within jq programs, as today jq is a very nice little sandbox -- I don't want to surprise anyone by making it not a sandbox by default! In any case, we'll be able to rationalize other options away with I/O primitives, though we might only do that in a jq2 executable.)

In all cases the readers will generate. Callers that want one item should call first(stdin) or the like, where first will be defined much like limit.

pkoppstein commented 10 years ago

In the meantime, could we please have system/1 (like awk)? This would resolve (or at least mitigate) several outstanding issues (notably #147, #492, #503).

Whether it is a generator or returns an array does not much matter to me; and if it is reimplemented later using a more generic framework, so be it. Thank you.

zmwangx commented 10 years ago

@pkoppstein I'd really like to see something like awk's system. That would at least solve a range of formatting problems for me, where jq's built-in format strings fall short.

nicowilliams commented 9 years ago

With a streaming parser option on deck, a way to read from stdin in a jq program is becoming urgent.

A thought occurs: reading from stdin (or even a file handle passed by the C program) is not really appropriate, in part because in an invocation like curl ... | jq . file0 file1 .. fileN there are N+1 inputs, not a single file handle. And interleaving with jq's reading from them can even be desirable.

A first step might be to add an input and inputs builtins that use a C caller-provided callback to read one more input (input) or stream all inputs inputs.

nicowilliams commented 9 years ago

Superseded by #925.