oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

wicketd requests to MGS fail with HTTP 400 and a 0-length response body #3103

Open jgallagher opened 1 year ago

jgallagher commented 1 year ago

Today when trying to update 6 gimlets in the dogfood rack simultaneously, three of the updates failed partway through because wicketd got an HTTP 400 with no body response from MGS. There was no corresponding entry in the MGS logs, which makes it very likely hyper itself was sending the 400, which it does if it receives a non-HTTP request. https://github.com/hyperium/hyper/issues/3225 is a request for better after-the-fact debugging support from hyper, but in the meantime, the next time we try to mupdate the dogfood rack we should snoop the localhost traffic between wicketd and MGS in hopes of catching details on what's causing these 400s:

snoop -o mgs-traffic.snoop -x0 -d lo0 port 12225

We should (probably?) also add some kind of retries around at least some of the requests wicketd makes of MGS.

davepacheco commented 1 year ago

Could this have been caused by oxidecomputer/stlouis#454?

jgallagher commented 1 year ago

Could this have been caused by oxidecomputer/stlouis#454?

It's certainly possible. This occurred when installinator was incorrectly sending extremely large progress reports to wicketd, and we never saw it after we fixed that issue - I could believe that those reports pushed wicketd's heap up into the bad VA range and cause it to send a few malformed HTTP requests, and fixing the large reports kept us out of the bad VA range since then.