Closed gruuya closed 1 year ago
The implementation looks good to me, but I don't know how browsers/caches will handle it. AFAICT, this is to spec.
According to MDN, when Content-Type
is used as a Response header, there is no restriction on the character set of the parameters. However, MDN does mention that when it's used as a Request header, there are some characters that will cause it not to be a CORS-safelisted request header when they are included:
yes, with the additional restriction that values can't contain a CORS-unsafe request header byte:
0x00-0x1F
(except0x09
(HT)),"():<>?@[\]{}
, and0x7F
(DEL).
So while we're using it as a response header here, I do worry that some browser/proxy implementations might see {
and }
and (incorrectly) interpret it as an invalid header.
As far as I can see, the spec for Parameters
does not specify any illegal characters (other than, presumably, the separators =
and ;
).
The spec does say this, but (ironically) I'm unsure how to parse the english:
Note: Parameters do not allow whitespace (not even "bad" whitespace) around the "=" character.
Is this saying that parameters do not allow whitespace around the =
character, or that parameters do not allow whitespace anywhere inside their value?
EDIT: I guess whitespace inside the value is okay, based on this example of a parameter:
Accept: audio/*; q=0.2, audio/basic
One thing I'd be particularly curious to test is whether Cloudflare caching behavior changes when including the Content-Type
header, and if we can still use the trick of appending .csv
to make sure it's cached.
I guess how we serialize it depends what we want to do with the data. It would be nice to use some standardized format, like a serialization of Arrow types, because then we could use (somewhat) standardized parsers if we want to coerce types (e.g. turning a DATETIME into a Date()
object in JavaScript). However, if we're returning responses as JSON, and not a Parquet file, then what is the point of parsing it with Arrow?
I found arrow-js, which has code for coercing values to JavaScript types: https://github.com/apache/arrow/blob/main/js/src/type.ts
We could implement something like that in our client(s) (it's kind of up to the client what it does with it), or maybe somehow use arrow-js itself, but I think the important thing is that we would want the types to match the keys of this enum used by arrow:
https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203
Can we get those somehow? Is that what the serde serialization effectively does?
It looks like arrow-js can maybe read a record batch from JSON? https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/reader.ts#L153-L166
which calls
export const isArrowJSON = (x: any): x is ArrowJSONLike => {
return isObject(x) && isObject(x['schema']);
};
but the schema
type is just any
, and I'm having trouble following how this works. Based on the arrow-python docs, it defaults to inferring types from values, but you can also pass it an explicit schema: https://arrow.apache.org/docs/python/json.html#automatic-type-inference
Interesting how it coerces type string by trying to convert it to a date first, and then a string.
This makes me think that it might be best to just use simple SQL types, since the motivating use case for this feature is displaying types to the user. And there is no avoiding that some coercion code needs to specify how to coerce values to special types (e.g. timestamp to new Date()
), so it doesn't really matter what serialization format the header uses in that case, since the user needs to write some code to define which fields need to be coerced anyway (or a library can try to guess it based on the SQL type).
https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203
Can we get those somehow? Is that what the serde serialization effectively does?
Not sure -- I wouldn't want us to pull in arrow-js (or copypaste code from it) just to map these magic numbers back to enums. I was thinking about using this JSON format.
The options we seem to have are:
BIGINT
, TEXT[]
...)I wouldn't want us to pull in arrow-js (or copypaste code from it)
Yeah ofc. I was wondering if those types come from some arrow standard that we're already using. The arrow-js wouldn't be involved on the server side, but it would be nice if they were somehow compatible with some arrow-js code so that a client using arrow-js could load the HTTP jsonlines response into arrow from JSON and provide it with the schema from the header.
I think returning SQL-looking data types might be the simplest solution, but I worry it's non-standard (ofc SQL is standard, but it's the "-looking" part that concerns me).
When loading into arrow in JS, it seems unavoidable that the user needs to provide coercion functions: https://arrow.apache.org/docs/js/#create-a-table-from-javascript-arrays
The only exception would be if we were loading a .arrow
file (Arrow IPC format) from a URL. So maybe in the future Seafowl could support returning data in Arrow IPC format. But as long as we're returning JSON, if we want to go back to arrow, it doesn't matter whether the schema is provided in "SQL-looking types" or arrow types, because we need to provide the coercion functions in either case. And it's easier to parse SQL-looking types, so I'd prefer those, I guess.
:shrug:
And we could always support returning both, defaulting to "SQL looking types," but the user could specify a preference in some request header (maybe Content-Type
).
Ok, so I found the arrow-js code that can parse schema from JSON:
Based on the fact it's accessing ["type"]["name"]
, I assume it expects the same serialization format that we see here in the nested types.
It would be cool to support this, because it means the JS client can use arrow-js
and doesn't need to write its own parsers for the schema. But I don't intend to add that soon, it would just be a nice-to-have.
AFAICT, the "verbose serialization format" that @gruuya linked is what we would need for that: https://github.com/splitgraph/seafowl/blob/main/src/schema.rs#L72-L85 But it's of course quite verbose (are there length limits on this header?).
Meanwhile, for basic use cases like showing the user the type of the column in a shell, the SQL-looking types seem sufficient.
LGTM
Provide this info via the content-type response header.
Closes #356.