Closed onpaws closed 1 year ago
Fwiw, I don't think this should be configurable (not for now at least); there are more pertinent knobs that still aren't customizable (e.g. see #50), and I'm wary of bloating up the config with less important items.
I think just extending the existing tests should be sufficient to merge this.
Fwiw, I don't think this should be configurable (not for now at least); there are more pertinent knobs that still aren't customizable (e.g. see #50), and I'm wary of bloating up the config with less important items.
I think just extending the existing tests should be sufficient to merge this.
Thanks Marko, yes probably too soon to add new config options. In the future depending how we want to allow user customization e.g. once we return >1 timing measurement, it will presumably be time to consider.
I will switch to X-Seafowl-Query-Time
, unless anyone objects, thanks!
In terms of answering "when should Seafowl return this header", it seems to me we have two options:
Seafowl-Trace: true
in the requestIIUC I think @mildbyte prefers option 1, and @milesrichardson prefers option 2. (at least, I had that impression based on MM chats.)
What do you think @gruuya?
I would prefer option 1 (always) too. But just note that it won't always be available to client code in a cross-origin browsing context unless the header is included in the CSV value of Access-Control-Expose-Headers
.
But I'm hesitant to always return Access-Control-Expose-Headers
. Perhaps that could be opt-in via a request flag. Or perhaps it's not a problem to always include it anyway, since any problem it would cause for someone running a proxy in front of Seafowl, would only be relevant when they wanted the header anyway - at which point they could add it to their own code where they specify exposed headers (that is, the relatively harmless failure case would be their own headers overriding a Seafowl header that's meant to be opt-in anyway).
I had originally said that the timing data shouldn't go into the Content-Type
header alongside the schema. And it certainly feels like it would be overloading the semantic purpose of that header. But, after thinking more on it, I'm not convinced it's a problem to include timing data in addition to the schema. At first I thought it would be wrong because you don't want to cache it, but actually it makes sense to cache it, because it's timing how long it took for the origin Seafowl instance to execute the original query and return the result that was ultimately saved to the cache. So it's totally sensible for the timing data to be included in the cached response that's served by Cloudflare, or even from your browser, since neither of those are re-executing the query, and the time Seafowl originally spent executing the query is still a relevant metric.
So another question would be: what response headers does Cloudflare save in its cached objects? We know that it saves Content-Type
, which is maybe a point in favor of including timing data there too. But I don't know what it does with other response headers (I assume it defaults to saving them?)
@gruuya @mildbyte What do you think about including the timing data in the Content-Type
header along with the schema?
@gruuya
I think just extending the existing tests should be sufficient to merge this.
Is this what you had in mind? Wasn't sure if you meant literally extending some already existing test, or adding to the existing test suite.
If there's some recommended way to hit both cached_
.../uncached_
paths from one single test, that seems nice to avoid what would likely otherwise be code duplication.
I noticed certain macros e.g. #[case::uncached_post("POST", "/q")]
that suggest maybe we could use uncached_get
and cached_get
to test both paths from a single test? Not sure if I'm understanding that correctly though.
One thing I just learned is that apparently even default CORS safelisted headers have some prerequisites:
For Content-Type: needs to have a MIME type of its parsed value (ignoring parameters) of either application/x-www-form-urlencoded, multipart/form-data, or text/plain.
In our case we have application/json
so it seems we won't gain any benefit from shoehorning the timing data into content-type
anyway.
@milesrichardson given this, and the fact that @onpaws discovered some handy DataFusion-provided internal metrics (which we may want to supplement along with the request-to-response time as is done currently), I'd vote for going with a separate dedicated header (assuming that it is cached by CF).
@onpaws you could also simply extend the last test in that module; it hits both GET and POSTS endpoints. In addition to presence, I'd test that the value of the header is as expected (i.e. something that can be parsed as a float, between 0 and 1).
Cloudflare seems to cache a lot of response headers, if not all of them. This screenshot is from one of our pages with the cache disabled in Firefox and with a Cloudflare cache hit, showing X-Cache and X-Varnish headers (that are added by our own Varnish) being returned by Cloudflare in its cached response:
@gruuya
One thing I just learned is that apparently even default CORS safelisted headers have some prerequisites:
Read that a bit more carefully: those restrictions are for the request content-type
header and do not apply to the response. (I made the same mistake the first two times I read it in the original issue 😸)
Oh damn, good catch :D
@onpaws you could also simply extend the last test in that module; it hits both GET and POSTS endpoints. In addition to presence, I'd test that the value of the header is as expected (i.e. something that can be parsed as a float, between 0 and 1).
Ended up doing this approach, cheers.
Now we append it to the existing test hopefully in the way you desired. While I was at it I renamed it to test_http_type_conversion_and_timing_header
which is perhaps a bit long but hopefully makes things clear for future us/contributors.
Agreed checking the header is "as expected" is worthwhile and added a second test for that. I wasn't initially sure the lines of responsibility/best ways to handle weird edge cases when parsing floats from strings. Along the way learned Rust core now uses Eisel-Lemire. Infinity and NaN scenarios were first a concern but IIUC it seems like .is_finite() handles both, which I believe should cover us. If I missed any cases please lmk.
Verified the tests work by temporarily commenting out the header injection code, and re-running, and got the expected result.
CI passed and it seems we've probably reviewed this PR sufficiently. Any objections to merging?
I haven't reviewed the code (and don't intend to), but once you merge it, can you update the docs in the marketing repo to reference the header?
I haven't reviewed the code (and don't intend to), but once you merge it, can you update the docs in the marketing repo to reference the header?
Sure! How do you feel about e.g. this
with some basic X-Seafowl-Query-Time
example similar to the other ones on this page?
Thanks Marko! Merging now
Seeks to implement #357
X-Seafowl-Runtime
(currently hardcoded)seafowl.toml
, or at theINFO
/DEBUG
level, or some other way[1] https://www.w3.org/TR/server-timing/ [2] https://arrow.apache.org/datafusion/user-guide/sql/explain.html