splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
392 stars 9 forks source link

Return result type info on GET/POST endpoints #367

Closed gruuya closed 1 year ago

gruuya commented 1 year ago

Provide this info via the content-type response header.

Closes #356.

milesrichardson commented 1 year ago

The implementation looks good to me, but I don't know how browsers/caches will handle it. AFAICT, this is to spec.

According to MDN, when Content-Type is used as a Response header, there is no restriction on the character set of the parameters. However, MDN does mention that when it's used as a Request header, there are some characters that will cause it not to be a CORS-safelisted request header when they are included:

yes, with the additional restriction that values can't contain a CORS-unsafe request header byte: 0x00-0x1F (except 0x09 (HT)), "():<>?@[\]{}, and 0x7F (DEL).

So while we're using it as a response header here, I do worry that some browser/proxy implementations might see { and } and (incorrectly) interpret it as an invalid header.

As far as I can see, the spec for Parameters does not specify any illegal characters (other than, presumably, the separators = and ;).

milesrichardson commented 1 year ago

The spec does say this, but (ironically) I'm unsure how to parse the english:

Note: Parameters do not allow whitespace (not even "bad" whitespace) around the "=" character.

Is this saying that parameters do not allow whitespace around the = character, or that parameters do not allow whitespace anywhere inside their value?

EDIT: I guess whitespace inside the value is okay, based on this example of a parameter:

Accept: audio/*; q=0.2, audio/basic
milesrichardson commented 1 year ago

One thing I'd be particularly curious to test is whether Cloudflare caching behavior changes when including the Content-Type header, and if we can still use the trick of appending .csv to make sure it's cached.

milesrichardson commented 1 year ago

I guess how we serialize it depends what we want to do with the data. It would be nice to use some standardized format, like a serialization of Arrow types, because then we could use (somewhat) standardized parsers if we want to coerce types (e.g. turning a DATETIME into a Date() object in JavaScript). However, if we're returning responses as JSON, and not a Parquet file, then what is the point of parsing it with Arrow?

I found arrow-js, which has code for coercing values to JavaScript types: https://github.com/apache/arrow/blob/main/js/src/type.ts

We could implement something like that in our client(s) (it's kind of up to the client what it does with it), or maybe somehow use arrow-js itself, but I think the important thing is that we would want the types to match the keys of this enum used by arrow:

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

milesrichardson commented 1 year ago

It looks like arrow-js can maybe read a record batch from JSON? https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/reader.ts#L153-L166

which calls

export const isArrowJSON = (x: any): x is ArrowJSONLike => {
    return isObject(x) && isObject(x['schema']);
};

but the schema type is just any, and I'm having trouble following how this works. Based on the arrow-python docs, it defaults to inferring types from values, but you can also pass it an explicit schema: https://arrow.apache.org/docs/python/json.html#automatic-type-inference

Interesting how it coerces type string by trying to convert it to a date first, and then a string.


This makes me think that it might be best to just use simple SQL types, since the motivating use case for this feature is displaying types to the user. And there is no avoiding that some coercion code needs to specify how to coerce values to special types (e.g. timestamp to new Date()), so it doesn't really matter what serialization format the header uses in that case, since the user needs to write some code to define which fields need to be coerced anyway (or a library can try to guess it based on the SQL type).

mildbyte commented 1 year ago

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

Not sure -- I wouldn't want us to pull in arrow-js (or copypaste code from it) just to map these magic numbers back to enums. I was thinking about using this JSON format.

The options we seem to have are:

milesrichardson commented 1 year ago

I wouldn't want us to pull in arrow-js (or copypaste code from it)

Yeah ofc. I was wondering if those types come from some arrow standard that we're already using. The arrow-js wouldn't be involved on the server side, but it would be nice if they were somehow compatible with some arrow-js code so that a client using arrow-js could load the HTTP jsonlines response into arrow from JSON and provide it with the schema from the header.

milesrichardson commented 1 year ago

I think returning SQL-looking data types might be the simplest solution, but I worry it's non-standard (ofc SQL is standard, but it's the "-looking" part that concerns me).

When loading into arrow in JS, it seems unavoidable that the user needs to provide coercion functions: https://arrow.apache.org/docs/js/#create-a-table-from-javascript-arrays

The only exception would be if we were loading a .arrow file (Arrow IPC format) from a URL. So maybe in the future Seafowl could support returning data in Arrow IPC format. But as long as we're returning JSON, if we want to go back to arrow, it doesn't matter whether the schema is provided in "SQL-looking types" or arrow types, because we need to provide the coercion functions in either case. And it's easier to parse SQL-looking types, so I'd prefer those, I guess.

:shrug:

milesrichardson commented 1 year ago

And we could always support returning both, defaulting to "SQL looking types," but the user could specify a preference in some request header (maybe Content-Type).

milesrichardson commented 1 year ago

Ok, so I found the arrow-js code that can parse schema from JSON:

https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/metadata/json.ts#L144-L206

Based on the fact it's accessing ["type"]["name"], I assume it expects the same serialization format that we see here in the nested types.

It would be cool to support this, because it means the JS client can use arrow-js and doesn't need to write its own parsers for the schema. But I don't intend to add that soon, it would just be a nice-to-have.

AFAICT, the "verbose serialization format" that @gruuya linked is what we would need for that: https://github.com/splitgraph/seafowl/blob/main/src/schema.rs#L72-L85 But it's of course quite verbose (are there length limits on this header?).

Meanwhile, for basic use cases like showing the user the type of the column in a shell, the SQL-looking types seem sufficient.

milesrichardson commented 1 year ago

LGTM