Closed jameshadfield closed 5 months ago
This is certainly odd behaviour. Off the bat, I can't say with certainty what the cause might be. It's certainly able to be understood, but it's the kind of situation where being able to reproduce and poke at the behaviour is immensely useful to gaining that understanding. It would also be useful to have seen the server's logs resulting from running your test script, as they'll show caching information.
I initially uploaded them using
aws s3 cp
(so the data wasn't zipped), and then re-uploaded them usingnextstrain remote upload ...
(which magically zips the data).
Between these two uploads, did you access the dataset via the nextstrain.org server?
In any case, unless you preserved the full context in which you saw the behaviour or we can reproduce it (I can't off hand), then I suspect it's not worth digging into more as we'll be missing access to important information.
Between these two uploads, did you access the dataset via the nextstrain.org server?
Yes (although maybe only for one of the APIs, see below). That's how the cache became invalid, although it should have been detected as so during revalidation by one of the many layers involved here (make-fetch-happen
, minipass-fetch
, HTTP cache etc).
It would also be useful to have seen the server's logs resulting from running your test script, as they'll show caching information.
Server logs (while running the test script whose output is shown above, and some line breaks added for clarity)
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Fri, 19 Apr 2024 04:53:46 GMT)
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/zika.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/zika.json (cache revalidated, timestamp Fri, 19 Apr 2024 04:53:46 GMT)
Getting (nextstrain) datasets for: prefix=staging/seasonal-flu/h3n2/ha/2y
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Sun, 21 Apr 2024 21:01:17 GMT)
Getting (nextstrain) datasets for: prefix=staging/zika
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/zika.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/zika.json (cache revalidated, timestamp Sun, 21 Apr 2024 21:01:17 GMT)
~This highlights that the (server-side) fetch for ...seasonal-flu_h3n2_ha_2y.json
is subtly different if it comes through the RESTful API vs charon, as indicated by different revalidation times. This probably matches up with the different content-encodings passed through to the client (script). This points to a difference in how we are handling the routes, and it'd be good to remove this difference.~ See more below.
unless you preserved the full context in which you saw the behaviour or we can reproduce it (I can't off hand), then I suspect it's not worth digging into more as we'll be missing access to important information.
We may decide it's not worth digging into this one, but it's important for the wider team to realise that the server has caching bugs in it. We've talked about server logging improvements, and this would be a good example where debug-level logging (to a file?) would have been useful; we could use this in tests to better debug the many CI failures we are getting.
Node.js v18 (we were using v16 until a few weeks ago) has a browser-compatible fetch
implementation so switching to that would remove some dependencies and reduce the number of layers involved, however I can't find a good doc about the cache it uses at present.
A little more digging and logging of the upstream response (S3 + cache) in the server:
Restful API:
proxyFromUpstream
request url: https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json
request accept-encoding: gzip
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Fri, 19 Apr 2024 04:53:46 GMT)
proxyResponseBodyFromUpstream
(upstream response) headers:
Content-Encoding null **** zika shows "gzip" here *****
Content-Type application/json
etag "a6fb19653970598e7cb514f01122cf73"
last-modified Tue, 16 Apr 2024 04:11:22 GMT
cache key make-fetch-happen%3Arequest-cache%3Ahttps%3A%2F%2Fnextstrain-staging.s3.amazonaws.com%2Fseasonal-flu_h3n2_ha_2y.json
cache hash sha512-AxkNCcsms9SmJPycZsJaZPX2n7dGH8NCQ5px9t4opKF4eL8CFUmHsCE8%2BrX9VxKrKYje0pigr6IA8FsPNr4TfQ%3D%3D
cache time Fri, 19 Apr 2024 04:53:46 GMT
(upstream request) compress? false
(_our_ response) Content-Encoding undefined **** zika shows "gzip" here *****
proxyResponseBodyFromUpstream streaming to client (without decompressing)
Charon API:
Getting (nextstrain) datasets for: prefix=staging/seasonal-flu/h3n2/ha/2y
proxyFromUpstream
request url: https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json
request accept-encoding: gzip
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Mon, 22 Apr 2024 01:01:47 GMT)
proxyResponseBodyFromUpstream
(upstream response) headers:
Content-Encoding null **** zika shows "gzip" here *****
Content-Type application/json
etag "a6fb19653970598e7cb514f01122cf73"
last-modified Tue, 16 Apr 2024 04:11:22 GMT
cache key make-fetch-happen%3Arequest-cache%3Ahttps%3A%2F%2Fnextstrain-staging.s3.amazonaws.com%2Fseasonal-flu_h3n2_ha_2y.json
cache hash sha512-AxkNCcsms9SmJPycZsJaZPX2n7dGH8NCQ5px9t4opKF4eL8CFUmHsCE8%2BrX9VxKrKYje0pigr6IA8FsPNr4TfQ%3D%3D
cache time Mon, 22 Apr 2024 01:01:47 GMT
(upstream request) compress? false
(_our_ response) Content-Encoding undefined **** zika shows "gzip" here *****
proxyResponseBodyFromUpstream streaming to client (without decompressing)
Looking at the server, it seems clear that the upstream response headers (S3 via a cache layer) doesn't have content encoding set. When you clear the cache this is set to gzip (as expected) so I think it's pretty clear this is a cache-revalidation bug in one of the fetch layers.
It's probably not worth spending too much more time here, but there is one question I'd like to answer to better understand the system:
From my logging (above) it seems our server does not set the content-encoding header but the test script ("client") shows that this is set (to "gzip") for the RESTful API but not the charon API. I can't explain this difference, and would really like to know where in the chain content-encoding: gzip
is being set. @tsibley do you have advice on what to look for here?
P.S. I'm pretty sure the cache times change between requests because the first request updates the cache before the second request arrives, so that's not strange like I originally thought. It happens for zika as well.
@jameshadfield Do you have a reproduction for this locally starting from a fresh cache? Because try as I might, I can't reproduce this with a test dataset that I manipulate on S3 between requests to a local server. I see the cache updated (not revalidated) on the first request after switching the dataset in S3 from no compression to compressed.
The Content-Encoding: gzip
added after your logging is not a surprise to me. The whole server has the compression
middleware enabled:
You can see its logs by running the server under DEBUG=compression
, for example.
Looking back at what you said you did and the logs you experienced…
As part of https://github.com/nextstrain/nextstrain.org/pull/811 I uploaded new datasets to our staging bucket, by downloading then uploading the flu/seasonal/h3n2/ha/2y JSONs and renaming them to seasonal-flu, so we could update our tests to use the latter. I initially uploaded them using aws s3 cp (so the data wasn't zipped), and then re-uploaded them using nextstrain remote upload ... (which magically zips the data).
How did you download the flu/seasonal/h3n2/ha/2y
JSONs? If you downloaded them as-is without decompression, then they'd be gzipped on your local disk. And if you then uploaded them back to S3 using aws s3 cp
without specifying --content-encoding gzip
, then they'd still be gzipped but nothing would know about it (not S3, not our server). And I believe that would lead to the logs you saw.
Node.js v18 (we were using v16 until a few weeks ago) has a browser-compatible
fetch
implementation so switching to that would remove some dependencies and reduce the number of layers involved, however I can't find a good doc about the cache it uses at present.
I'm often for moving from third-party libraries to stdlibs when possible, but I'm skeptical about this. It's marked "Stability 1 - Experimental" up until v20 LTS, only just stable in v21. The experimental stability level is "may have backwards incompat breakages at any time; do not use in production". make-fetch-happen
is widely used as part of npm
and has caching behaviour that's pretty well described. (I also put a bunch of work into making its caching more robust re: encodings/compression.) The core fetch()
right now strikes me as not fewer layers, just different layers.
Node's stdlib fetch()
appears to be from Undici.
It is also light on cache documentation, but does expose this: https://undici.nodejs.org/#/docs/api/CacheStorage
In general, Undici looks useful, but I'm still not sold on switching. Switching to using its fetch()
effectively would basically require using the rest of the library too, which does not appear to be part of the stdlib. So a lot of benefit goes out the window.
Ok, I'm gonna stop looking at this for now, but what I found re: caching in Undici is not promising. It's fetch()
implementation is littered with TODO: cache
comments, the open issues are scattered and ongoing, and I'm pretty sure that it's confusing Request.mode
for Request.cache
in two places in the stubbed implementation.
The Content-Encoding: gzip added after your logging is not a surprise to me. The whole server has the compression middleware enabled.
Thanks! This was the layer I had forgotten about. I will find some time to check if the upstreamResponse body is uncompressed and we're compressing it in the compression middleware or if the body is already gzipped and we're gzipping it again. The script output in the original issue here would point to the latter, especially if the content-encoding is being added to the RESTful API but not Charon, but I'll try to confirm this.
Do you have a reproduction for this locally starting from a fresh cache?
Nope (and I haven't tried to recreate it). I'm sure it's not a common problem we will run into, but it's certainly a bug and it's been a useful exercise for me to learn more about caching.
Thanks for the pointers!
Caching bugs are certainly possible and in general I wouldn't bet against them, but in this case I don't think there's a bug here in our server.
With some more testing, all signs I can see point to this sequence of events:
The last step is a cache revalidation not a cache update because S3 is not changing the ETag when the object's Content-Encoding header changes. In my prior testing, the S3 object's bytes were changing, so the ETag was changing, and I couldn't reproduce what you reported. But once I ensure the bytes don't change, just change the Content-Encoding, I can reproduce.
$ echo '{"hello":"gzip"}' | gzip > gzipped
$ aws s3api put-object --bucket nextstrain-staging --key trs_test.json --body gzipped --content-type application/json
{
"ETag": "\"3c099fc7a460e777422ae4445291de38\"",
"ServerSideEncryption": "AES256"
}
$ curl -si --compressed https://nextstrain-staging.s3.amazonaws.com/trs_test.json | (while read -r line; do echo "$line"; [[ "$line" == $'\r' ]] && exec hexdump -C; done)
HTTP/1.1 200 OK
x-amz-id-2: qzk1gCZiVQtAjWeBe3jQZuDywixkXKcQFk0FyxiV4BmRag9bFUZXmei4UjDyB2ovhMiOZQKxJak=
x-amz-request-id: EEDMETHC1F1R9JTE
Date: Mon, 22 Apr 2024 21:24:11 GMT
Last-Modified: Mon, 22 Apr 2024 21:23:56 GMT
ETag: "3c099fc7a460e777422ae4445291de38"
x-amz-server-side-encryption: AES256
x-amz-version-id: null
Accept-Ranges: bytes
Content-Type: application/json
Server: AmazonS3
Content-Length: 37
00000000 1f 8b 08 00 60 d5 26 66 00 03 ab 56 ca 48 cd c9 |....`.&f...V.H..|
00000010 c9 57 b2 52 4a af ca 2c 50 aa e5 02 00 f5 54 66 |.W.RJ..,P.....Tf|
00000020 66 11 00 00 00 |f....|
00000025
$ aws s3api put-object --bucket nextstrain-staging --key trs_test.json --body gzipped --content-type application/json --content-encoding gzip
{
"ETag": "\"3c099fc7a460e777422ae4445291de38\"",
"ServerSideEncryption": "AES256"
}
$ curl -si --compressed https://nextstrain-staging.s3.amazonaws.com/trs_test.json | (while read -r line; do echo "$line"; [[ "$line" == $'\r' ]] && exec hexdump -C; done)
HTTP/1.1 200 OK
x-amz-id-2: AMoa8yHwNoWimnCMV+uckIfl94btqoWZGV21vlcpcT2wY8WVfnZV3qINkvqu+wLdk8lz0/VN1sU=
x-amz-request-id: ZERP234SJ600759G
Date: Mon, 22 Apr 2024 21:24:24 GMT
Last-Modified: Mon, 22 Apr 2024 21:24:21 GMT
ETag: "3c099fc7a460e777422ae4445291de38"
x-amz-server-side-encryption: AES256
Content-Encoding: gzip
x-amz-version-id: null
Accept-Ranges: bytes
Content-Type: application/json
Server: AmazonS3
Content-Length: 37
00000000 7b 22 68 65 6c 6c 6f 22 3a 22 67 7a 69 70 22 7d |{"hello":"gzip"}|
00000010 0a |.|
00000011
Start our server with a fresh cache and then run your test script (pointed at staging/trs/test
) after each S3 upload in the example above to see behaviour you report reproduced.
ETags are supposed to be aware of content-encoding, but S3's apparently aren't.
... this sequence of events
Yeah, that sounds about right. And in reality that's a perfectly plausible series of events that we should expect to happen every now and again when uploading datasets.
To summarise: the upstream content-encoding has changed and we are not updating our cache. Now we may say the problem is upstream of our server, and it may well be, but at the end of the day this is a bug which manifests in our server. And maybe it's not worth solving - I'm guessing it's not - but something we should be aware of.
The compression middleware, and more specifically its different behaviour across our two APIs, was the piece of the puzzle that made the output very hard to understand. Thanks for looking into it for me, the pointers were helpful for me to figure out what's happening.
The compression middleware, and more specifically its different behaviour across our two APIs, was the piece of the puzzle that made the output very hard to understand.
Yeah. This automatic compression wasn't being applied to the /charon/getDataset
endpoint because that endpoint wasn't setting a Content-Type. When I noticed that in the debug logs from compression
, I opened #835.
As part of writing up this issue I fixed it, but it's still really weird to me, and I'd like to understand why this happened.
Description of bug
As part of #811 I uploaded new datasets to our staging bucket, by downloading then uploading the flu/seasonal/h3n2/ha/2y JSONs and renaming them to seasonal-flu, so we could update our tests to use the latter. I initially uploaded them using
aws s3 cp
(so the data wasn't zipped), and then re-uploaded them usingnextstrain remote upload ...
(which magically zips the data). The strangeness is to do with the associated content-encoding metadata and/or the actual encoding of the file. The file is gzipped (as it's 1.1MB, and uncompressed it's 37MB), and the content encoding is gzip (can see this in AWS console).Using a small script to access this file (cf. the staging/zika dataset) via various APIs gives me weird results.
Test script
Run the server locally at the default port 5000 and then run the following script: ```js import fetch from 'node-fetch'; async function main() { const paths = [ "staging/seasonal-flu/h3n2/ha/2y", "staging/zika", ] /* Make a GET request to our REST API asking for JSON data */ console.log("\nREST API") for (const path of paths) { const mediaType = "application/json" const url = `http://localhost:5000/${path}` console.log("\tfetch:", url, mediaType) const req = fetch(url, accept(mediaType)); const res = await req; console.log("\t\tresponse content encoding:", res.headers.get('content-encoding')) /* print out the start of the response. Note that this is after decompression, if any, and I don't know how to truly know if decompression is happening. If content-encoding is gzip then does this mean the response has for sure been decompressed? */ const text = await res.text(); console.log("\t\tResponse:", text.slice(0,50)); } /* now do the same but hitting the Charon endpoint */ console.log("\nCharon API") for (const path of paths) { const url = `http://localhost:5000/charon/getDataset?prefix=${path}` console.log("\tfetch:", url) const req = fetch(url) const res = await req; console.log("\t\tresponse content encoding:", res.headers.get('content-encoding')) const text = await res.text(); console.log("\t\tResponse:", text.slice(0,50)); } /* for good measure hit the HTTPS S3 endpoints */ console.log("\nS3 HTTPS ") for (const path of paths) { const key = path.replace(/^staging\//, "").replace(/\//g, '_') + ".json" const url = `https://nextstrain-staging.s3.amazonaws.com/${key}`; console.log("\tfetch:", url) const req = fetch(url) const res = await req; console.log("\t\tresponse content encoding:", res.headers.get('content-encoding')) const text = await res.text(); console.log("\t\tResponse:", text.slice(0,50).replace(/\n/g, "When I run this script locally I get:
The weirdness (for me) is:
requests.test.js
and this test passes on GitHub actions but not locally. This is what led me down this rabbit hole, and hinted at a caching issue.I have a mad hunch that the server's reported
content-encoding: gzip
for the REST API response is somehow corrupted, and the script'sfetch
call is not respecting it, but I don't understand the mechanics well enough.The fix
The following patch, which creates a new (empty) cache for the server's
fetch
machinery, fixed it (response content encoding: gzip for all responses, and correct JSON contents):Note: keeping the old cache but changing to
cache: "no-store"
, orcache: "no-cache"
, orcache: "reload"
did not fix the bug.Questions
Why was the cache we are using not detecting that the cached data was now invalid? We set
cache: no-cache
on the request so the response must be validated with the origin server before each reuse, but it seems we didn't pick up that the contents/metadata had changed. (I mean, is this what happened? Clearly the caching used byfetch
(make-fetch-happen
) is at fault here somehow.)Why was the content-encoding different on the charon (
null
) vs RESTful API (gzip
) response, and why did this seemingly not affect decompression (or lack thereof) of the response?P.S. Looks like MacOS will remove
/tmp
files after 3 days, so there were a few different things that lined up in order for me to crash into this bug.