microbiomedata / nmdc-edge

Web-based interface to the NMDC EDGE platform
https://nmdc-edge.org
4 stars 0 forks source link

File uploads time out after 5 minutes (Jetstream2) #232

Closed eecavanna closed 4 weeks ago

eecavanna commented 3 months ago

History

In #218, @mflynn-lanl reported that users could not upload large files. The root cause was that Cloudflare — which, at the time, was configured in "Proxied" mode — was imposing its 500 MiB (Mebibyte) request size limit. Uploads of files smaller than 500 MiB would succeed, while uploads of files larger than 500 MiB would fail. To fix that, we reconfigured Cloudflare to be in "DNS Only" mode instead of "Proxied" mode. As a result, we were able to upload files larger than 500 MiB.

Problem

Although we can now upload files larger than 500 MiB; if the upload takes longer than 5 minutes, the upload fails and the client receives an HTTP 408 Request Timeout response.

image

image

Task 📋

Appendix

You can convert a number of Mebibytes into Bytes at: https://www.dr-lex.be/info-stuff/bytecalc.html

You can generate "dummy" files (of a given size in Bytes) by running this command:

$ head -c 523239424 /dev/urandom > my-fake-499-mebibyte.fastq.gz
$ head -c 524288000 /dev/urandom > my-fake-500-mebibyte.fastq.gz
$ head -c 525336576 /dev/urandom > my-fake-501-mebibyte.fastq.gz
eecavanna commented 3 months ago

At the time the upload fails (i.e. approximately 5 minutes after starting the upload), I see these lines in the caddy container's log:

exouser-caddy-1  | {"level":"error","ts":1720044809.4940464,"logger":"http.handlers.reverse_proxy","msg":"reading from backend","error":"read tcp 192.168.0.2:60702->192.168.0.4:5000: read: connection reset by peer"}
exouser-caddy-1  | {"level":"error","ts":1720044809.4941583,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","upstream":"app:5000","duration":309.991915694,"request":{"remote_ip":"__REDACTED__","remote_port":"61441","client_ip":"__REDACTED__","proto":"HTTP/2.0","method":"POST","host":"edge-dev.microbiomedata.org","uri":"/auth-api/user/upload/add","headers":{"Priority":["u=1, i"],"Accept-Language":["en-US,en;q=0.9"],"X-Forwarded-Proto":["https"],"Cookie":[],"Accept":["application/json, text/plain, */*"],"Sec-Ch-Ua-Mobile":["?0"],"Content-Length":["1000000531"],"Cache-Control":["no-cache"],"Sec-Ch-Ua":["\"Brave\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\""],"Referer":["https://edge-dev.microbiomedata.org/user/files"],"Sec-Fetch-Mode":["cors"],"Sec-Fetch-Dest":["empty"],"Sec-Gpc":["1"],"Sec-Fetch-Site":["same-origin"],"Pragma":["no-cache"],"X-Forwarded-Host":["edge-dev.microbiomedata.org"],"Authorization":[],"Accept-Encoding":["gzip, deflate, br, zstd"],"X-Forwarded-For":["__REDACTED__"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"],"Origin":["https://edge-dev.microbiomedata.org"],"Content-Type":["multipart/form-data; boundary=----WebKitFormBoundaryzCs3N3kdNMcOfvdy"],"Sec-Ch-Ua-Platform":["\"macOS\""]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"edge-dev.microbiomedata.org"}},"error":"reading: read tcp 192.168.0.2:60702->192.168.0.4:5000: read: connection reset by peer"}
exouser-caddy-1  | {"level":"error","ts":1720044809.4942892,"logger":"http.log.access","msg":"handled request","request":{"remote_ip":"__REDACTED__","remote_port":"61441","client_ip":"__REDACTED__","proto":"HTTP/2.0","method":"POST","host":"edge-dev.microbiomedata.org","uri":"/auth-api/user/upload/add","headers":{"Content-Length":["1000000531"],"Authorization":[],"Referer":["https://edge-dev.microbiomedata.org/user/files"],"Accept-Language":["en-US,en;q=0.9"],"Sec-Fetch-Mode":["cors"],"Accept-Encoding":["gzip, deflate, br, zstd"],"Accept":["application/json, text/plain, */*"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"],"Origin":["https://edge-dev.microbiomedata.org"],"Sec-Fetch-Dest":["empty"],"Cookie":[],"Sec-Fetch-Site":["same-origin"],"Pragma":["no-cache"],"Cache-Control":["no-cache"],"Sec-Ch-Ua":["\"Brave\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\""],"Content-Type":["multipart/form-data; boundary=----WebKitFormBoundaryzCs3N3kdNMcOfvdy"],"Sec-Ch-Ua-Mobile":["?0"],"Sec-Ch-Ua-Platform":["\"macOS\""],"Sec-Gpc":["1"],"Priority":["u=1, i"]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"edge-dev.microbiomedata.org"}},"bytes_read":878797125,"user_id":"","duration":0,"size":0,"status":408,"resp_headers":{"Server":["Caddy"],"Alt-Svc":["h3=\":443\"; ma=2592000"]}}

Highlights (from those lines):

The timestamp (1720044809.4940464) translates to:


I also see this in the app container's log, but the timestamps seem to be 1 minute off from the ones above:

exouser-app-1    | 2024-07-03 22:14:29 debug:   /auth-api/user/upload/add: {}
exouser-app-1    | 2024-07-03 22:14:29 error:   [object Object]
eecavanna commented 3 months ago

I do see one timeout mentioned in the Caddy documentation, whose default duration is 5 minutes. It's the idle timeout. I don't know whether it affects us, since we're using Caddy in reverse-proxy mode; but here it is just in case.

eecavanna commented 3 months ago

I don't see any 5-minute (by default) timeouts in the documentation of the express-fileupload NPM package used by the React app, although this React app is using old versions of lots of packages, so the docs might not be accurate for the version of the package in use.

ssarrafan commented 2 months ago

@eecavanna @mflynn-lanl is this slated for this sprint? Can this be slotted for the next sprint?

eecavanna commented 2 months ago

I think @mflynn-lanl and I will discuss this at the squad meeting on Tuesday (I missed last week's meeting) and then update this ticket based on that discussion (with one potential update being to move it to next sprint).

eecavanna commented 4 weeks ago

Closing in favor of #289, which mentions this symptom and links to this ticket. I think resolving #289 will eliminate this symptom (and the underlying issue) also.