Open jgallagher opened 1 year ago
Could this have been caused by oxidecomputer/stlouis#454?
Could this have been caused by oxidecomputer/stlouis#454?
It's certainly possible. This occurred when installinator was incorrectly sending extremely large progress reports to wicketd, and we never saw it after we fixed that issue - I could believe that those reports pushed wicketd's heap up into the bad VA range and cause it to send a few malformed HTTP requests, and fixing the large reports kept us out of the bad VA range since then.
Today when trying to update 6 gimlets in the dogfood rack simultaneously, three of the updates failed partway through because wicketd got an HTTP 400 with no body response from MGS. There was no corresponding entry in the MGS logs, which makes it very likely
hyper
itself was sending the 400, which it does if it receives a non-HTTP request. https://github.com/hyperium/hyper/issues/3225 is a request for better after-the-fact debugging support from hyper, but in the meantime, the next time we try to mupdate the dogfood rack we should snoop the localhost traffic between wicketd and MGS in hopes of catching details on what's causing these 400s:We should (probably?) also add some kind of retries around at least some of the requests wicketd makes of MGS.