oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
53 stars 2 forks source link

Downloading > 1 GiB files from buildomat fails some (most?) of the time #36

Open jgallagher opened 11 months ago

jgallagher commented 11 months ago

Attempting to curl this URL from atrium https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip fails some of the time at almost exactly 1 GiB:

% curl -OL https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 90 1146M   90 1034M    0     0  4079k      0  0:04:47  0:04:19  0:00:28 5366k
curl: (18) HTTP/2 stream 1 was reset

It doesn't always fail, particularly from the lab, but it seems to usually fail from outside the lab. From @jmpesp:

$ curl -OL https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 89 1146M   89 1030M    0     0  1477k      0  0:13:14  0:11:54  0:01:20  796k
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

@augustuswm (on this exact URL) and @leftwo (on different buildomat URLs pointing to artifacts of similar size) also reported similar errors.

jordanhendricks commented 11 months ago

I believe I might've hit this on a job today?

https://buildomat.eng.oxide.computer/wg/0/details/01HAWHVPHGHVX0S9TBA8NVN03V/RweIODoZXuGlIiqGSUnsKUmnExB7SujD7GE8UF00QgpXrmwM/01HAWHX9J1PZK629EK4EXP4TSB#S2159

jgallagher commented 11 months ago

That might have a different underlying cause; dendrite-asic.tar.gz is much less than 1 GiB (I think??), and in the original issue every reproduction failed at just over 1 GiB. I agree the error message looks like the symptom is the same though, so maybe I'm overindexing on the 1 GiB thing.

jclulow commented 4 months ago

I have (finally) redone the downloading logic a bit here for two reasons:

We'll get log messages like this, now, if there's an interrupted connection:

06:20:05.188Z ERRO github-server: download failed: published file: owner oxidecomputer/omicron series rot-all version b4e1a285ef812bc0376959e177c7ab3f90893e73 name repo.zip.parta: interrupted on client side
    bytes_expected = 1023973427
    bytes_transferred = 11763402
    download = url
    hdr_x_forwarded_for = 66.117.152.2:12520
    local_addr = 0.0.0.0:4021
    method = GET
    msec = 2362
    offset = 49768397
    rate_mb = 4.748702113165932
    remote_addr = 172.31.43.126:46187
    req_id = 59608fab-62d2-446b-b5ec-cf2eab1845c3
    uri = /public/file/oxidecomputer/omicron/rot-all/b4e1a285ef812bc0376959e177c7ab3f90893e73/repo.zip.parta
06:20:05.191Z ERRO buildomat: download failed: published file: user 01FV089DQ9F11ETVWFXWW3GYAD series rot-all version b4e1a285ef812bc0376959e177c7ab3f90893e73 name repo.zip.parta: interrupted on client side
    bytes_expected = 1023973427
    bytes_transferred = 17287276
    download = s3
    hdr_x_forwarded_for = 44.227.183.26:21131
    local_addr = 0.0.0.0:9979
    method = GET
    msec = 2368
    offset = 49768397
    rate_mb = 6.961349156131005
    remote_addr = 172.31.43.126:50081
    req_id = 1bf03bcf-bf68-4bae-ba9e-a915709fba10
    uri = /0/public/file/gong-238580629/rot-all/b4e1a285ef812bc0376959e177c7ab3f90893e73/repo.zip.parta

I have tested interrupting and resuming downloads fairly extensively using curl -C - ... and I think it's all pretty solid. I also haven't been able to reproduce this issue, at least today while screwing around, so maybe the underlying cause is not as prominent anymore (touch wood!).

Either way, I think we need to start trying to do this routinely again now that the instrumentation is in place so that we can catch it again if it's still occurring.

labbott commented 4 months ago

Running into issues downloading repo-all.zip from https://github.com/oxidecomputer/omicron/pull/5368/checks?check_run_id=23407750304, specifically https://buildomat.eng.oxide.computer/wg/0/artefact/01HTJJWV8XMMMB5VRKR4TFP1Z2/wjep8cQCErAvNV4tQjcFYPUWl6qf2W5fbmuWTfyXuVPqn1Uc/01HTJJXHBM2XYSKA0WY1SRYFFY/01HTJS79K68DJFBDBJ0MSMR4PJ/repo-rot-all.zip using firefox

jgallagher commented 4 months ago

This happened to me downloading from jeeves; unfortunately I was running curl -fsSL so all I got error-wise was:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Just wanted to note this here in case having a timestamp is useful (this error is from a few minutes before I posted this).

jclulow commented 4 months ago

This happened to me downloading from jeeves; unfortunately I was running curl -fsSL so all I got error-wise was:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Just wanted to note this here in case having a timestamp is useful (this error is from a few minutes before I posted this).

@jgallagher What URL would you have been downloading at the time?

jclulow commented 4 months ago

Huh, this is exciting!

2024/04/15 16:45:20 [error] 20162#1: *71970046 upstream prematurely closed connection while reading upstream, client: 66.117.152.2, server: buildomat.eng.oxide.computer, request: "GET /public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip HTTP/2.0", upstream: "http://172.31.100.193:4021/public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip", host: "buildomat.eng.oxide.computer"
16:45:19.646Z ERRO github-server: download failed: published file: owner oxidecomputer/omicron series rot-all version bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd name repo.zip: backend error: request or response body error: error reading a body from connection: end of file before message length reached
    bytes_expected = 1688983946
    bytes_transferred = 1084817164
    download = url
    hdr_x_forwarded_for = 66.117.152.2:61273
    local_addr = 0.0.0.0:4021
    method = GET
    msec = 245996
    offset = 0
    rate_mb = 4.205605238039445
    remote_addr = 172.31.43.126:36004
    req_id = ec1b9fd3-b25f-4e07-bb92-063af2084a49
    uri = /public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip
16:45:01.287Z ERRO buildomat: download failed: published file: user 01FV089DQ9F11ETVWFXWW3GYAD series rot-all version bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd name repo.zip: interrupted on client side
    bytes_expected = 1688983946
    bytes_transferred = 1090458095
    download = s3
    hdr_x_forwarded_for = 44.227.183.26:43318
    local_addr = 0.0.0.0:9979
    method = GET
    msec = 227644
    offset = 0
    rate_mb = 4.568272464098826
    remote_addr = 172.31.43.126:57618
    req_id = 58e7f56e-38d7-4e08-a2da-29d8deb8fcfe
    uri = /0/public/file/gong-238580629/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip

So the interruption appears to have been somewhere in here:

you -----> buildomat-web0 ----> buildomat0 ---------> buildomat-web0 -----> buildomat0 -------> s3
             nginx        buildomat-github-server        nginx            buildomat-server
                                               \~~~~~~~~~~~~~~~~~~~~~~~/
                                                     interruption

I have some things I can tweak, at least, to improve things here. Thanks for the heads up!

labbott commented 4 months ago

Hit the same issue

laura@jeeves /staff/lab/madrid/laura-update-2024-04-16 $ OMICRON_COMMIT="69ed7ad871969912e44d02620430ed2e3e7c2fdd" /staff/lab/madrid/download-tuf-repo.sh 
Grabbing tuf repo artifacts (omicron@69ed7ad871969912e44d02620430ed2e3e7c2fdd)
  Downloading TUF manifest...
  Downloading TUF repo (resumably)...
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
jclulow commented 4 months ago

Were you able to resume the download with range requests?

jgallagher commented 4 months ago

I was yesterday, yeah. Looks like Laura was using the same script I was:

  Downloading TUF repo (resumably)...
labbott commented 4 months ago

Were you able to resume the download with range requests?

Yes, I ran the script again and it completed