oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 38 forks source link

Image upload sometimes stalls with HTTP/2 #3559

Open luqmana opened 1 year ago

luqmana commented 1 year ago

This was the image upload failing on Firefox/macOS bug that @david-crespo was running into.

I've looked into it some more and from what I can tell, at some point the browser gets stalled while uploading some chunks. On the console side we split up the file into 384KiB chunks which we try to upload 6 at a time (to not hit browser concurrency limits). It doesn't happen every time, but every so often there will be a chunk or two where it seems like the browser made the request but there's no response. (At least from the browser Web Developer Tools Network tab).

I changed the console side to add a query parameter on each individual chunk upload and for the stalled chunks I saw no mention of such a request in the Nexus logs. Next I tried a packet capture with Wireshark and after setting up SSLKEYLOGFILE (because this will only repro with https; we'll get back to that) I did see the browser making those requests. And in fact, I could see it sending until at some point it stops with still no response.

After not being able to repro without TLS and then seeing the packet cap, I realized we're using HTTP/2 for compatible clients. The browser is maintaining a single connection and using multiple HTTP/2 streams to make the different requests. Ok, so is something else blocking our image upload somehow? Cue some more reading about HTTP/2 and it definitely has a concept of flow control that each peer maintains separately.

Basically, for each side, there's a connection level flow control window size as well as a per-stream size. Every byte sent decrements the available bytes in the window. If the sender has exhausted the window size, they must not send anymore until a WINDOW_UPDATE is received from the peer that tells it there's more space.

I need to look into it some more but it seems like the browser might think its exhausted the window but there's no window update from the nexus size. Hacked in a tracing subscriber to see anything useful from hyper and I do see mentions of the stalled streams but not super familier with hyper enough to decode them yet. (https://github.com/hyperium/hyper/issues/2899 seems relevant maybe?)

luqmana commented 1 year ago

I've found that navigating around and doing other things on the console shortly before trying the upload makes it more likely to fail. The HTTP/2 connection reuse might figure into that. So force reloading the image upload page could help?

Anecdotally, Chrome and the cli haven't run into this and the packet capture for those show Nexus sending WINDOW_UPDATE frames more regularly. So for a workaround that can be noted.

luqmana commented 1 year ago

I took a peek at how the cli was doing image uploads and compared it to what the console was doing and noticed there was a difference in the size of chunks both used (cli, console).

As a quick test I changed the console to use the same chunk size:

index 879342f4..93210bc0 100644
--- a/app/forms/image-upload.tsx
+++ b/app/forms/image-upload.tsx
@@ -136,7 +136,8 @@ const ABORT_ERROR = new Error('Upload canceled')
  * line breaks can introduce up 4% more overhead, but that doesn't apply here
  * because we are only sending the encoded data itself.
  */
-const CHUNK_SIZE = Math.floor((512 * KiB * 3) / 4)
+const CHUNK_SIZE = 512 * KiB

With that, I've started seeing regular WINDOW_UPDATE frames to firefox now and I can't reproduce the bug anymore.

It's hard to say exactly why that works but the HTTP/2 spec does not mandate any specifics inhow a sender or receiver needs to manage these windows:

HTTP/2 defines only the format and semantics of the WINDOW_UPDATE frame (Section 6.9). This document does not stipulate how a receiver decides when to send this frame or the value that it sends, nor does it specify how a sender chooses to send packets. Implementations are able to select any algorithm that suits their needs.[^1]

It could be that increasing CHUNK_SIZE is enough to trigger hyper's threshold for sending WINDOW_UPDATE's.

Regardless, I think we definitely want to think about adjusting the server side initial window frame size instead of just relying on the default hyper selects. There's also a disabled-by-default "adaptive window size" feature that implements Bandwidth-delay product to dynamically adjust the windows size (https://github.com/hyperium/hyper/issues/1960).

[^1]: RFC 7540 § 5.2.1

augustuswm commented 8 months ago

We received a report of a user hitting this on Windows 11 + Chrome (latest versions as of this update) when uploading to the colo rack release 5 (stuck at 100% upload). Specifically attempting to upload this image: https://download.fedoraproject.org/pub/fedora/linux/releases/39/Cloud/x86_64/images/Fedora-Cloud-Base-39-1.5.x86_64.raw.xz

Going to check logs on the rack to see if I can find anything relevant.

jmpesp commented 8 months ago

Should we be attempting to prevent users from uploading compressed images? There's no way they can boot from a .raw.xz file, as no where in the stack do we decompress anything.

david-crespo commented 8 months ago

We can certainly check the extension client-side at the very least. Actually, it would have to be client-side because the API never sees the filename. We have an issue (https://github.com/oxidecomputer/console/issues/1846) to at least say something about this on the form, but I'll add a note about enforcement.

jmpesp commented 8 months ago

Maybe a warning instead of outright not allowing?

jmpesp commented 8 months ago

I just tried uploading Fedora-Cloud-Base-39-1.5.x86_64.raw.xz to a locally running Pantry (no control plane or anything, just some Python), and saw

16:18:28.009Z INFO crucible-pantry (dropshot): request completed
    error_message_external = Data length not block size multiple
    error_message_internal = Data length not block size multiple
    latency_us = 1386
    local_addr = [::]:12345
    method = POST
    remote_addr = [::1]:49244
    req_id = 4c866850-7e60-4d48-99f7-a632cefde750
    response_code = 400
    uri = //crucible/pantry/0/volume/914ad140-5f54-4480-891c-857a5be89864/bulk-write

This makes sense: that URL currently returns Content-Length: 446022784 which is not divisible by 512.

augustuswm commented 8 months ago

I just got an email back from them and confirmed that they were not extracting the image first. Also confirmed that they were actually getting an error during upload. I'll spin this off into a new ticket so that we can possibly better handle this error.

image

david-crespo commented 8 months ago

That's a relief. That likely means one of their chunk requests errored out and the built-in retries weren't enough. We could probably stand to pull the error message out of the API response and display it there. The fact that the size still got up to 100% is weird, though. The chunk size is 512 KiB, so if one failed we should see the amount uploaded down by half a MiB. Could be a mistake in our error handling — maybe it failed, then retried, then succeeded, and we still show it as an error.