splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
392 stars 9 forks source link

Return result set type information in the HTTP response #356

Closed mildbyte closed 1 year ago

mildbyte commented 1 year ago

We currently don't know the exact types of each column in the query result. We should encode the the information as JSON in a response header.

milesrichardson commented 1 year ago

tl;dr Idea: use parameters of content-type response header, e.g.:

content-type: application/octet-stream; total_volume=FLOAT; year=INT;
...
{"total_volume":78.0,"year":2018}

If implementing this feature, remember to also make sure the header is included in Access-Control-Expose-Headers (not to be confused with Access-Control-Allow-Headers, which is about which headers are allowed in the request) so that cross-origin client code can read it from the response.

This is not necessary for CORS-safelisted response headers, which include only Cache-Control, Content-Language, Content-Length, Content-Type, Expires, Last-Modified, and Pragma (deprecated).

Perhaps it would be appropriate to use the "parameters" of the Content-Type response header, which is actually kind of intended for this purpose (but could have unknown effects on proxies like Cloudflare):

The indicated media type defines both the data format and how that data is intended to be processed by a recipient

The type/subtype MAY be followed by semicolon-delimited parameters (Section 5.6.6) in the form of name/value pairs. The presence or absence of a parameter might be significant to the processing of a media type, depending on its definition within the media type registry. Parameter values might or might not be case-sensitive, depending on the semantics of the parameter name.

For example, for this request:

curl 'https://demo.seafowl.io/q/4be9110c4fe720dae8310fde08f028b49d01a5c41b70df21668a143265f6a4bc.csv' \
  -H $'x-seafowl-query: SELECT%20year%3A%3Ainteger%20AS%20year%2C%0ASUM(volume)%20AS%20total_volume%0AFROM%20supply_chains%0AWHERE%20country_of_production%20%3D%20\'ARGENTINA\'%0AAND%20country_of_import%20%3D%20\'AFGHANISTAN\'%0AAND%20commodity%20%3D%20\'CORN\'%0AGROUP%20BY%201%20ORDER%20BY%201%20ASC'

which outputs:

{"total_volume":78.0,"year":2018}

maybe we could return the content-type header like:

content-type: application/octet-stream; total_volume=FLOAT; year=INT;
gruuya commented 1 year ago

Another approach might be to include that info in the output itself, like snowflake does, and then nest the actual results under some specific key, e.g. data.

However this would be a breaking change, and possibly hurt readability. Also the header approach is simpler, so I went with that for now, check it out: https://github.com/splitgraph/seafowl/pull/367

milesrichardson commented 1 year ago

Nesting the results under a specific key is how we do it in the web DDN, but that means abandoning jsonlines. I'm not a fan of that approach, because it means we (as a client) cannot support streaming result sets, and must buffer the entire result into memory before we can parse it.

In practice, at least in the madatdata JS client, we still buffer the entire result set into memory right now, but that's only because I haven't implemented readable streams.

The header approach seems hacky at first glance, but I think it's actually quite elegant - I mean after all, that's what a header is for, isn't it?