jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.23k stars 1.57k forks source link

Add exec/2: posix_spawn #3133

Open CosmicToast opened 4 months ago

CosmicToast commented 4 months ago

This is an initial MVP with quite a few things still missing (such as: better error messages, documentation, tests). Despite this, it is already feature-complete on POSIX platforms (on Windows it currently reports an "unsupported on this platform" error).

The signature is anything | exec(path; [args…]). path and all args must be strings. anything will be converted to a string if it isn't a string in memory, then piped into the process' stdin. The output is all stdout of the process. The exit code is not reported.

Technically, "path" can be a simple name and $PATH will be searched. This is because the underlying function is posix_spawnp. This can bec hanged easily.

The process does not have access to environment variables. This can be changed as well.

Piping between programs works. Here's an example to try it out: tostring | exec("seq"; [.]) | exec("wc"; "-l") Expected output when inputting numbers is that number, but it notably goes through seq, then line-counting.

CosmicToast commented 4 months ago

There are some cases of prior-art, but from what I can tell no one's implemented posix_spawn. Additionally, this should be significantly more robust. Finally, it's not bundled with any other functionality, can be merged as is, and should be trivial to gate behind a feature flag or similar.

I would appreciate testing on additional platforms (I've verified functionality + asan + ubsan + leaksan on amd64 linux and aarch64 macos). I don't have any amd64 macos or aarch64 linux machines that I'd be comfortable testing on. Windows is not yet implemented as I do not have any Windows machines. I plan to look into tests shortly (probably Friday or Sunday), and will add documentation afterwards.

If there are any issues with this PR as it is, please let me know!

CosmicToast commented 4 months ago

Forgot I should probably tag this. Updates #1614 Updates #147

1614: this is a more robust implementation of the request 147: 147 is possible using this. The specific example requested there looks like this on macos (as an example):

.releases |= map(.date = (exec("date"; ["-jf%F", .date, "+%s"]) | trim))
pkoppstein commented 4 months ago

There has historically been strong support for a "shell-out" feature for jq, but it hasn't happened for a variety of reasons, one of which is security. In particular, see PR #3092.

To avoid confusion about this, I would recommend adding the ability to make it trivially easy to enable or disable the shell-out feature as soon as possible. One option, of course, would be to conform with the spirit of PR #3092 and respect the "--sandbox" flag envisioned in the PR.

Since that PR has not been accepted, and since there has been an aversion to creating any situation that might lead to the addition of multiple command-line options, I would like to mention the possible advantages of an "--options" flag that would expect a JSON object specifying an (open-ended) set of jq options. (The --options flag could conceivably also accept a filepath pointing to a file holding such a JSON object.) Assuming that the default would be for the shell-out feature to be enabled on platforms on which it is supported, to disable it one would write something like: jq --options '{"exec": false}' ...

Btw, from these brief notes, I'm not sure whether the intent is for the output of exec to be a single JSON string, or a stream of such strings, one for each line produced by the command. In either case, I believe some justification would be helpful. Thanks.

CosmicToast commented 4 months ago

To avoid confusion about this, I would recommend adding the ability to make it trivially easy to enable or disable the shell-out feature as soon as possible.

As of currently, there is no sandbox. I can't depend on an unmerged PR… as it's not merged yet. This PR is also significantly simpler, and so may be easier to merge before, and add support for --sandbox afterward. It's trivial to add any sort of validation step at the head of the function, and exit with an error if it doesn't pass, and as such I'd consider this requirement passed.

I'm not sure whether the intent is for the output of exec to be a single JSON string, or a stream of such strings, one for each line produced by the command. In either case, I believe some justification would be helpful.

The output of exec is a single JSON string. Since this type of feature is intended for duct-taping together external filters, a single aggregated string has the most compatibility. Additionally, I expect JSON outputs to be a thing (e.g. by way of making HTTP requests). It's easier to simply exec(…) | fromjson this way. Finally, it keeps the code simple: the part responsible for aggregating the output (rather than figuring out how to buffer it by-line) is fast and spans a mere ~8 lines of code (starting with jv output =).

This is a limitation, since now, the entire output must fit in memory, but once this becomes a use-case, it feels like jq is no longer the correct tool for the job (why are you shelling in gigabytes upon gigabytes of data?).

Another design I had considered was returning an object like {stdout: "string", stderr: "string", exitCode: 0}. This could then be wrapped in builtins.jq to be equivalent to the behavior seen here. I ultimately decided against it, since (once again), the intention is more for a simpler, duct-tape style solution rather than treating this like a full FFI. This is ultimately backwards-compatible though, as a more thorough function can be introduced later with an exec/2 compatibility layer.

pkoppstein commented 4 months ago

As of currently, there is no sandbox. I can't depend on an unmerged PR… as it's not merged yet.

Yes, there is a bit of a chicken-and-egg problem, but if you were (for example) to add --sandbox first, then the onus of integration would fall on the other side.

But perhaps you missed my main point, which is that without some way to toggle the feature easily, there is (in my estimation) no chance of your hard work being merged. This would probably mean your PR will languish, or worse. I've seen it happen far too often.

Thanks for explaining the rationale for your handling of STDOUT. I think it would be a (fairly grievous) mistake not to provide a convenient way for the output to be presented as a stream (of JSON strings). jq is, after all, strongly stream-oriented (much like *nix :-), and the stream-oriented approach has the big advantage of being economical with memory, whereas the "slurping" approach has no real advantage as it's always easy to "slurp". Perhaps the resolution here is to pick one of the three main options, and then allow provide exec/3, where the third argument would allow the required variant to be specified.

CosmicToast commented 4 months ago

Yes, there is a bit of a chicken-and-egg problem, but if you were (for example) to add --sandbox first, then the onus of integration would fall on the other side.

"Onus"? Again, it's possible to merge this before --sandbox as it is now. It is not possible to merge this without first merging --sandbox if I was to integrate it. From my reckoning, --sandbox is not anywhere close to merging. I'd rather keep this PR focused on the feature, rather than integrating some completely unrelated thing.

But perhaps you missed my main point, which is that without some way to toggle the feature easily, there is (in my estimation) no chance of your hard work being merged. This would probably mean your PR will languish, or worse. I've seen it happen far too often.

Frankly, this comes off as extremely condescending. This isn't my first open source project contribution, and far from the biggest. I'd maybe be more receptive to this from a maintainer. As I mentioned, adding this is trivial, but to my reckoning, the specific manner of toggling the feature is going to be a much bigger source of bikeshedding than anything else (you already mentioned two options!). Github has a feature allowing maintainers to add commits to PRs and this PR has it enabled: I actively encourage a maintainer to add the 3 LOC corresponding to their preferred toggling mechanic to the start of the function.

In the meanwhile, the sandbox feature you claim to be a requirement for getting the PR merged is not receiving much interest from actual maintainers in that thread. So again, I find myself hard-pressed to imagine that it's going to become a strict requirement for merging.

Thanks for explaining the rationale for your handling of STDOUT. I think it would be a (fairly grievous) mistake not to provide a convenient way for the output to be presented as a stream (of JSON strings).

I disagree. Picking an arbitrary character to separate a stream (in your case, a newline) would make the interface asymmetric. To make it symmetric, I'd also have to accept a stream of tokens. This, however, makes the interface much clunkier to use (you now need a | tostring | add on the left and an | add on the right). If you look at every feature request for this type of filter, all of them show fairly small values going both in and out – that's the main utility. What's more, newlines aren't in every output, so for some use-cases this is strictly overhead. If jq supported an actual unix-style stream (i.e. it gave me an fd or something) that's a very different story, but that's just not how jq works internally.

To wit, this is another subject where I'd be much more receptive to feedback from a maintainer. However, as it is, you're coming off as pushy – I looked at the linked sandbox PR and saw similar requests that don't seem to be consistent with the existing codebase.

I think there is value in implementing feature requests that have been outstanding for over 10 years, and I think that the security implications of running untrusted code (i.e. downloaded .jq files, which is the risk here) are the same for all PLs.

pkoppstein commented 4 months ago

@cosmictoast - I'm sorry my comments came across as condescending and pushy. Since you indicated that your (mis)interpretation was partly based on my not being a maintainer, forgive me for pointing out that I am listed as such at https://github.com/jqlang/jq/blob/master/AUTHORS

However, my comments were offered not from the perspective of a maintainer, but as a jq enthusiast who has also been a long-time observer of jq's progress, and who has seen excellent contributions wither on the vine, as well as some excellent contributors becoming disenchanted.

As for the specific issues, I still think you may have misunderstood some of my points. For example, in suggesting that there be a trivial mechanism for toggling the feature on or off, I was not suggesting that you should somehow integrate the proposed --sandbox feature with yours. I was just saying that in my opinion, if you wanted to use the name "--sandbox" for the flag, you should feel free to do so.

Anyway, please don't feel obliged to respond to this post. I've made my main points and, as you say, it would be good to hear from others (such as @emanuele6 @itchyny @nicowilliams @wader ...)

CosmicToast commented 4 months ago

if you wanted to use the name "--sandbox" for the flag, you should feel free to do so

I still feel like this muddies up the PR. However, just as a demonstration, this is what integrating the --sandbox PR would look like.

Instead of this:

static jv f_exec(jq_state *jq, jv input, jv path, jv args) {
    int ret = 0;
    // …
}

You would have this:

static jv f_exec(jq_state *jq, jv input, jv path, jv args) {
    if (jq_is_sandbox(jq)) {
        jv_free(input), jv_free(path), jv_free(args);
        return jv_invalid_with_msg(jv_string("exec/2 is unavailable when jq is sandboxed"));
    }

    int ret = 0;
    // …
}

(Pardon the tabs, I tend to do style checkups as the last steps before de-drafting.) As such, I maintain that it's trivial to add. If it's really needed, I can add it after the sandbox PR is merged. However, adding a --sandbox flag right now does muddy the PR, and would create conflicts in the actual sandbox PR. If anything, I would rather it be added immediately post-merge, and then the sandbox PR can be rebased on top of main with the --sandbox flag already existing (i.e. it merely applied the policy to other entries).

I'm working on error reporting / tests / docs right now, so within a couple of hours you should have that be available here.

CosmicToast commented 4 months ago

I have initial docs and tests up, including valgrind. Tested are inputs, outputs, args. I'm still considering making exec output an object (out, err, exitcode), and adding system() as a convenience alias for exec() | .out | trim… I have some other things to do right now, but if there's interest, I can probably bang that out Sunday.

CosmicToast commented 4 months ago

I took a look at outputting a stream of tokens, but I'm not seeing how it would be done from the C side. Most generators ultimately output separate computations to produce streams, but that's not what's going on here. It could be added on top by doing an | rtrim | split("\n") | tostream or something, but that's hardly any faster.

Additionally, as expected, splitting the tokens on any input means it would become required to actually look for it in the input rather than doing buffering, and avoiding buffering (buf size of 1) to avoid consuming any input that's intended for the next token.

I.e. unless I'm mistaken, with the current design of jq, it doesn't seem to be possible.

CosmicToast commented 4 months ago

Alright, exec now returns an object like: {"out": "stdout goes here\n", "err": "stderr goes here\n", "status": 0} If the process is terminated by a signal, status is -1 and there's an additional key "signal" that gives the digit of the signal.

I also added exec/1 (empty args array) and a system alias that simply does | .out | rtrim (i.e. what most people seem to want and what I expect people to mostly use).

At this point, I think there's more or less nothing left to improve in a way that would be reasonable, so once this gets reviewed I'll do a code style pass, maybe add more tests, add myself to AUTHORS as a contributor, and recommend a squash commit.

pkoppstein commented 4 months ago

I took a look at outputting a stream ...

Thanks. Just to be clear, I don't think that not having the option to stream the output will or should be a "show-stopper"; the fate of PR #1005 corroborates your observations about the difficulty of doing so.

And for the record, I'd like to mention a use-case for such an option: random sampling from very large populations (e.g. reservoir sampling). Since jq does not currently have a PRNG, this is at best tricky in jq at present. The addition of a PRNG would certainly address the issue, so I find it intriguing that previous attempts to add a PRNG as a "built-in" have also foundered.

CosmicToast commented 4 months ago

I could probably take a look at adding a PRNG (namely PCG32) in a separate PR if there's interest. The main challenge (as mentioned) would be outputting a stream… but due to how PCG32 functions, there are things that could be done about it (I have two potential ones in mind, though I must admit I'm insufficiently familiar with the jq codebase to give an exact estimate of feasibility). Maybe sometime next week if I end up having time :)

pkoppstein commented 4 months ago

I could probably take a look at adding a PRNG (namely PCG32) in a separate PR if there's interest.

There is and has been strong interest in this. See e.g. the open ER at #677

Please also see https://github.com/jqlang/jq/pull/1260 -- both as potential inspiration and as a warning.

pkoppstein commented 4 months ago

The main challenge (as mentioned) would be outputting a stream…

If you use input as your model, you won’t have that problem. (Notice how inputs is defined in builtin.jq.)

CosmicToast commented 4 months ago

The main challenge (as mentioned) would be outputting a stream… If you use input as your model, you won’t have that problem. (Notice how inputs is defined in builtin.jq.)

This is off-topic for this PR. We can discuss it later in the PR in question if you'd like. Is there any chance we could get this one looked at first? While I do have some interest in getting a PRNG in via a focused PR, I don't want to commit the time for a feature I would get significantly less usage out of before the path to having PRs merged becomes clear.