jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
29.95k stars 1.55k forks source link

ER: curl #650

Closed pkoppstein closed 9 years ago

pkoppstein commented 9 years ago

There has been some discussion about endowing jq with curl-like support for retrieving information (and especially JSON) from remote resources. In order to expedite enhancements in this direction (as well as to provide support for reading from LOCAL files), I would like to propose that jq use the "easy interface" of libcurl. To this end, I am appending a stand-alone C program, parts of which could be used for integrating jq and libcurl to support synchronous retrieval.

I realize that libcurl may be overkill, so if there are better alternatives that would enable similar support for synchronous retrieval of JSON documents (both locally and remotely) to be provided expeditiously, that is fine. However, I'd also like to point out that libcurl could be used in the short-run without without making a long-term commitment to using libcurl: stability is only needed for the jq filters.

Still, libcurl has its advantages. It is widely used and will presumably be supported indefinitely. It is sophisticated and will provide a path for jq to follow as its own capabilities become more sophisticated.

Motivation

One of the reasons for requesting this enhancement is that I am using a RESTful collection of tens of thousands of JSON documents, linked together by ids in a kind of "graph database". For example, an id field in one document might be "1622", referring to another JSON document at

http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622

Similarly, there are references within this "graph database" to entities available elsewhere on the web as JSON objects.

The ability to query local files is also a major motivation.

Non-JSON resources

For the sake of simplicity, the following assumes that the resource being queried will return a single JSON entity.

Specification

In order to accommodate future enhancements, I would tentatively propose that all parameters except the URL and timeout be passed in via a JSON object, e.g.

 { "username": "jqUser", "password": "secret", 
   "headers": { "User-Agent": "jq" } }

One possibility along these lines would be for jq initially to support two jq filters:

def curl(obj; timeout): # timeout in seconds; input is a string specifying the URL with query parameters

def curl(obj): curl(obj;10);

These would of course either fail (or return null) or return a JSON entity.

Questions

What should the name of the jq filter for retrieving JSON entities be?

How should non-JSON resources be supported?

First Steps

The following is a standalone C program with functions that could be used to integrate jq with libcurl. Please feel free to use it.

// The following is based on http://curl.haxx.se/libcurl/c/getinmemory.html

// gcc -lcurl curl.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <curl/curl.h>

struct MemoryStruct {
  char *memory;
  size_t size;
};

static size_t
WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
  size_t realsize = size * nmemb;
  struct MemoryStruct *mem = (struct MemoryStruct *)userp;

  mem->memory = realloc(mem->memory, mem->size + realsize + 1);
  if (mem->memory == NULL) {
    /* out of memory! */ 
    fprintf(stderr, "WriteMemoryCallback: not enough memory (realloc returned NULL)\n");
    return 0;
  }

  memcpy(&(mem->memory[mem->size]), contents, realsize);
  mem->size += realsize;
  mem->memory[mem->size] = 0;

  // To verify mem->memory has the string:
  // printf("%s\n", mem->memory); 
  return realsize;
}

// timeout in seconds
long jv_curl(char *url, long timeout, void *userp) {
  struct MemoryStruct *chunk = (struct MemoryStruct *)userp;

  struct curl_slist *headers = NULL;

  CURL *curl = curl_easy_init();
  if (curl) {
    CURLcode res;
    curl_easy_setopt(curl, CURLOPT_URL, url);
    /* follow redirection */ 
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);

    curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout); 

    /* github requires User-Agent so for now ...*/
    headers = curl_slist_append(headers, "User-Agent: jq");
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);

    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);
    /* pass our 'chunk' struct to the callback function */ 
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)chunk);

    /* Perform the request, res will get the return code */ 
    res = curl_easy_perform(curl);

    /* Check for errors */ 
    if(res != CURLE_OK) {
      fprintf(stderr, "curl_easy_perform() failed: %s\n",
              curl_easy_strerror(res));
    } else {
       /* chunk->memory now points to a memory block that is chunk->size
        * bytes big and contains the remote file.
    */
      printf("TEST: %lu bytes retrieved\n", (long)chunk->size);
      printf("TEST: %s\n", chunk->memory);

    }
    /* always cleanup */ 
    curl_easy_cleanup(curl);
  }
  return (long)chunk->size;
}

int main() {
  struct MemoryStruct chunk;
  chunk.memory = malloc(1);  /* will be grown as needed by the realloc above */ 
  chunk.size = 0;            /* no data at this point */ 
  char *url = 
    // "http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622";
       "http://apicommons.org/api-commons-manifest.json";
  printf("%lu bytes retrieved\n", jv_curl(url, 10L, &chunk));
  printf("%s\n", chunk.memory); 

  return 0;

}
nicowilliams commented 9 years ago

Hey, this is very cool! Thanks! I'll do my best to integrate this soon. I can probably do this a bit over break, and if not the week after. It's also a kick in the pants to finish the module system.

Also, I've started on a streaming parser, which will help with huge JSON text inputs...

nicowilliams commented 9 years ago

I'm thinking that for GET everything should be passed as an input, URL, timeout, everything, otherwise we get nasty cross-product behaviors... A /1 builtin could take a stream of URLs and GET them all. We could call this GET/0 and GET/1.

For POST/PUT the interface design gets trickier.

Also, these have to be builtins, but I really want them in their own module namespace, partly so we can do something about authorizing programs to use modules, so that we can retain the current behavior of sandboxing by default. I'm not sure how to do this yet, but I really like the idea that jq programs are filters with no more harmful side-effects than local resource consumption (of course, a malicious jq program could do more, such as observe timing effects to steal secrets, for example, but let's leave that aside for now), so I'd like that to continue to be the default.

But again, I'm not sure how best to express sandbox vs. not-sandbox. What do you think?

pkoppstein commented 9 years ago

@nicowilliams asked:

What do you think?

It seems to me that it might be best to introduce support for these enhancements in (well-thought-out) phases, with Phase 1, for example, being confined to GET.

To me, GET is similar to env -- that is, no special flag is required to use env and I don't see any real need to add a special flag for GET in general, or for GET with non-"file:///" requests. If the clamor for such a flag (or flags) becomes deafening, it can always be added after Phase 1.

As for factorization -- I don't understand the concern about cross-product behaviors. Anyway, my main hope is that traversing a "graph database" will be straightforward. For this and other scenarios, the only thing that typically changes is the URL (by which I mean to include all the forms allowed by curl). That's why I suggested it be the input. (However, as I mentioned, I am not advocating conformance to libcurl for its own sake. For example, I could see the input being an array composed of bits of a curl URL.)

As for the timeout, it seems to me that that is something that might need to be tweaked independently of everything else and possibly even dynamically.

nicowilliams commented 9 years ago

What do you think?

[...] To me, GET is similar to env -- that is, no special flag is required to use env and I don't see any real need to add a special flag for GET in general, or for GET with non-"file:///" requests. If the clamor for such a flag (or flags) becomes deafening, it can always be added after Phase 1.

Some unfortunate servers use GET inappropriately...

Also, for the I/O builtins I have a variety of options: by default you get to read from stdin, write to stdout/stderr, but the jq program can be given the right to open files for reading, to open them for writing, and to popen() (execute stuff). HEAD/GET would be like opening arbitrary files, so that falls into the open-files-for-reading permission, and POST and friends fall into the open-files-for-writing permission.

I think perhaps that is as fine-grained as we can hope to get.

As for factorization -- I don't understand the concern about cross-product behaviors. [...]

It's just that jq function call argument lists generally result in cartesian product behavior that often surprises users. A GET/0 that takes URL, headers, and options in an input object (or perhaps an array of headers, options, and URL) would be easy to use. A GET/1 that takes as an argument a stream of URLs would GET each of them with the headers and options from ..

Basically, it's best to think of jq def function arguments as streams/futures and of . as the "plain" arguments of a jq def function, with many inputs == as many calls. We should design interfaces with this pattern in mind, and perhaps some syntactic sugar might help.

pkoppstein commented 9 years ago

Python "requests" is rightly well-regarded, and looking over the documentation (http://docs.python-requests.org/en/latest), a few thoughts occur to me.

First, in a nutshell, Python requests work like this:

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

Notice that the "what" (the URL) is kept separate from the "how" (here, the authorization information). More significantly, the returned value neatly encapsulates a ton of information.

Perhaps jq should have something like Requests::get as well as a slightly higher-level function (Requests::json) for URLs that are supposed to return JSON. For Requests::json, jq's try/catch mechanism could be used to provide information about errors.

Something like:

# input is the URL
def Requests::json(obj):
  Requests::get(obj) as $r
  | $r.status_code
  | if . == 200 then $r.json
    elif . == 400 then error("HTTP ERROR 400: Bad request")
    elif . == 401 then error("HTTP ERROR 401: Unauthorized")
    else error("HTTP STATUS \(.)")
    end;
nicowilliams commented 9 years ago

@pkoppstein

Given that in jq the best way to think about function arguments is as follows:

I think I'd prefer something like this:

["<URL>", {<headers>}, {<options>}] | HTTP::GET

or something like that. Among the available options would be whether to treat the response body as raw, raw and line-oriented (raw and slurped), or as JSON (and whether to stream and/or slurp), though perhaps we should separate jq from curl options, so we'd have:

["<URL>", {<headers>}, {<curl options>}, {<jq options>}] | HTTP::GET

If we make anything a closure argument, it should be the URL, in which case we'd have a "get all of these URLs" function:

[{<headers>}, {<curl options>}, {<jq options>}] | HTTP::GET(a_list_of_URLs)

Now, how to output the response??

The caller might not want to "slurp" the entire response body, instead outputting a value per-item in the response (e.g., if the response is a jq-like JSON text sequence like 0\n1\n2\n you might want HTTP::GET to output a stream of these values: 0, 1, 2).

But there's more than the response body: there's also the headers and status code, and even things like the server's certificate and cert chain, trust anchor to which the server cert was validated, ...

One option would be to output headers and other response metadata as the first value, then the response body as zero, one, or more values as appropriate. Another is to slurp the response body and then output [<response metadata>, <response body>]. Another is to output a stream of [<URL>, <response metadata>, <value from response body>], one per-value in the response body. These could all be options to the jq function, or we could have a different function for each of these options.

nicowilliams commented 9 years ago

To flesh that out a bit more, we might have:

def simple_GET:
    {slurp: true, stream: false, raw: false, metadata_first: false, metadata_always: false} as $jq_opts |
    {timeout: 3} as $curl_opts |
    {Accept: "application/json"} as $request_headers |
    [., $request_headers, $curl_opts, $jq_opts] | HTTP::GET;

# Get /this, /that, and /other relative to ., whatever that is, with one output per-resource
. + ("/this", "/that", "/other") | simple_GET
nicowilliams commented 9 years ago

Also, we might have a form where a closure is passed that decides whether the metadata is acceptable and returns true or false (or calls error):

def simple_GET:
    def check_it:
        ...;
    {check_metadata: true, slurp: true, stream: false, raw: false, metadata_first: false, metadata_always: false} as $jq_opts |
    {timeout: 3} as $curl_opts |
    {Accept: "application/json"} as $request_headers |
    [., $request_headers, $curl_opts, $jq_opts] | HTTP::GET(check_it);

Now we see that we have an ambiguity w.r.t. the other HTTP::GET/1 mentioned earlier, though the input array's length and the $jq_opts resolves it (though that feels like a hack).

pkoppstein commented 9 years ago

@nicowilliams observed:

Now we see that we have an ambiguity w.r.t. the other HTTP::GET/1 mentioned earlier ...

Exactly! HTTP::GET(a_list_of_URLs) in my opinion was not a good idea to begin with.

Your GET/0 and simple_GET/0 have it exactly right w.r.t. URLs.

nicowilliams commented 9 years ago

@pkoppstein This line of argument also leads one to apply the same design to the regexp builtins, no? Have we made a mistake with those?

But for 1.6 my hope is to move all builtins into appropriate builtin modules, with all the ones that people expect from 1.5 in a "jq" module that could be imported as import jq {version:1.5};. So there's no harm to our builtin design mistakes. We'll be able to fix them later.

pkoppstein commented 9 years ago

@nicowilliams asked:

This line of argument also leads one to apply the same design to the regexp builtins, no? Have we made a mistake with those?

We have not made a mistake with regexp, which can be thought of as adhering to a "substrate-as-input" design, which calls for URLs-as-input in the case of filters supporting HTTP GET. That is, "STRING | regexp(RE)" is entirely analogous to "URL | get( OBJ )".

By the way, since libcurl is not restricted to HTTP/HTTPS, and since I'm hoping that the new functionality will support the "file URI scheme" (file:///), I'm not sure that emphazising HTTP and GET in the module or function names is such a good idea.

nicowilliams commented 9 years ago

In <string> | regexp(<re>) <re> isn't an input but a stream of regexps -- it looks like an argument, but it can become a cartesian product. OTOH, {subject:<string>, re:<re>} | regexp has the same problem anyways, so it really all comes down to: what should be the input ., and which other things should be arguments. That requires thinking about what filters one might want to build, but in practice we can flip things around pretty easily anyways, so it probably doesn't matter (i.e., given def foo(a): ...;we can def bar(b): . as $dot | b | a($dot); to define the "flipped" version of foo). OK, good.

EDIT: As I remember, @pkoppstein first pointed out the ability to "flip" functions quite a while back.

As to all the protocols that curl supports, you're quite right, but there will be cases where a specific HTTP verb is desired, and while the name of the module and builtin might not say "HTTP", the details of HTTP will probably leak (e.g., headers).

How about:

# we shouldn't call it "url", as we'll probably want a module just for URI/URN/URL manipulations
module curl;

# The jack of all trades, takes a description of what to do on input.
# 
# Inputs are of the form:
# {url: <URL>, verb: <verb>, headers: {<headers>}, curlopts: ..., jqopts: ...}
#
# (verb being scheme-specific, and optional; if absent it
# defaults to GET or scheme-specific equivalent)
def perform: ...;

# Only fetches resources:
def get: ...;

# Like get, but inputs are strictly URLs:
def get(headers; curlopts; jqopts): ...;

# Only fetches resource metadata:
# def head: ...;
# def headers(headers; curlopts; jqopts): ...;

and so on.

pkoppstein commented 9 years ago

@nicowilliams wrote:

.. isn't an input but a stream of regexps ...

Yes -- you may have forgotten that we explored this together around the time of #524.

Regarding the jack-of-all-trades function, you most recently suggested that the input (.) have the form:

{url: , verb: , headers: {}, curlopts: ..., jqopts: ...}

Previously you had suggested an array. To me, the most important considerations would be (1) efficiency, and (2) ease of error-checking and minimizing the likelihood of errors in the first place. (Maybe the array-format would make it less likely that a URL would be missing?)

I was also wondering whether it mightn't be better to avoid nesting, except for "headers"; for example:

nicowilliams commented 9 years ago

Yes, I remember (see my edit above, alluding to that).

I agree that an array will perform better than an object, but if we're going over the wire it might not matter. OTOH, an array with just 4 things will have a memorable form, so, sure, but I do want to separate curl options from jq options.

jq options here would be the equivalent of the command-line --slurp, --raw-input, --stream, --seq, and so on, both for input processing (GET and such) and for output (POST/PUT/PATCH and such). Curl and jq options shouldn't get mixed up, as preventing future collisions would be difficult. The two need distinct namespaces.

(E.g., if you're GETting a text file, you probably want raw input, and possibly slurp (if the text is line-oriented). If you're GETting an application/json resource, then you probably don't want raw, and if it's huge then you'll want streaming. And so on.)

ModalUrsine commented 9 years ago

I just discovered jq (d'oh!). How long has this been going on, i.e. when was jq first made available to the world at large? thanx

nicowilliams commented 9 years ago

@stedolan's first commit on the public repo was on July 18, 2012.

ghost commented 9 years ago

Hey!

I'm very interested on this discussion. This jq+shell sample application we made for Typeform I/O would be way cleaner if it lived entirely inside jq: https://github.com/TypeformIO/JQ-FormCreation

nicowilliams commented 9 years ago

I think I'd rather see file/popen I/O builtins than have jq link with libcurl and OpenSSL and such. A module system extension for C-coded modules would also work.

dtolnay commented 8 years ago

I agree with Nico that popen builtin or C-coded module are the best ways to implement this.

As for design, I think we should focus on a general, low-level API analogous to Go's (*http.Client) Do. My weak preference is curl/0 with an input map similar to http.Request and an output map similar to http.Response. Then higher-level helpers can be built on top as we figure out which ones would be most useful.

{   url: "http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622",
    method: http::GET,
    header: {
        Accept: ["application/json"]
    },
    timeout: 5 * time::second}
| curl
| [.statuscode, .header["Content-Type"], .body|fromjson]
nicowilliams commented 8 years ago

@dtolnay Welllll, if you wanted to call jq from Java, say, then you'd be unhappy with popen on systems where the kernel/libc don't support vfork() and use it in posix_spawn(). Also, a curl jq interface must not expose the mess of CLI options that is curl. That said, we'll get a lot of mileage out of a shell-out, so we should do it.

As for a libcurl interface, if we ever do it at all then I'd like to do that via dynamically loaded jq functions. It's reasonable to have a build dependency on Oniguruma (or descendant) for regexp as that brings in no further dependencies, but once we're talking about about curl we also get OpenSSL and/or friends and things begin to get ugly. Also, we'd have to finish the C-coded generators business if we're going to talk to libcurl in any way other than through curl(1).

nicowilliams commented 8 years ago

glibc might support vfork() on some kernels nowadays since about a year ago, IIUC, but I'm probably not looking in the right place, and I'm just wasting my time. We should hope popen() uses posix_spawn(), and that the latter uses a non-COW vfork(), and if anyone is unhappy with the lack of a true vfork() then we can tell them to complain to their OS vendor/distro.