Uploading a part submits more content than s3-upload-stream believes it did

konklone commented 10 years ago

I'm attempting to upload a 20MB file, in the browser (tested on Chrome and Firefox), using the default values for part thresholds. I get this error back from Amazon on the first uploaded part:

<?xml version="1.0" encoding="UTF-8"?>
<Error>
  <Code>BadDigest</Code>
  <Message>The Content-MD5 you specified did not match what we received.</Message>
  <ExpectedDigest>bYzjLLOiqBm0yMj3hTOffA==</ExpectedDigest>
  <CalculatedDigest>aECNRNzOibB0KEbQq20gnw==</CalculatedDigest>
  <RequestId>0C0A54C16B694C70</RequestId>
  <HostId>BU4UIzhjaRngEvI1KkgBbcX38PhxctUpPSNKJO480oWIUWtIOnj/96zjw8JX9/ceX/8jROMBlXo=</HostId>
</Error>

Capturing and inspecting the HTTP request object at AWS MD5 calculation time, the body is 6242880 bytes long, and the request header is correctly to set to 6242880 bytes.

info

However, looking at the actual request that fired, in Chrome's network inspector, shows that the first part was POSTed with these headers:

Content-Length: 7282688
Content-MD5: bYzjLLOiqBm0yMj3hTOffA==

The MD5 that Amazon says it calculated, aECNRNzOibB0KEbQq20gnw==, is exactly what you get when MD5ing the first 7282688 bytes of the file (using the same MD5/base64 call to the crypto library that the AWS SDK uses).

So it looks like the actual network request is posting 7282688 bytes of the file, with a correct Content-Length -- but a Content-MD5 corresponding to only 6242880 bytes of the file.

I'm reporting it here, rather than the AWS SDK, because the 6242880 number is what s3-upload-stream uses as its default partSizeThreshold. (Also, is this a typo? The minimum is exactly 1000000 bytes smaller, as s3-upload-stream uses just below that.)

Any ideas?

For reference, this is the 20MB file, a text file with my handle repeated over and over.

nathanpeck commented 10 years ago

Interesting. I'll check into this as soon as I can.

konklone commented 10 years ago

Some more data points: This happens whether I set the partSizeThreshold to 5242880 or 6242880.

If I set it to 5242880, then I can upload a file of that size, but a file of even 1 byte more triggers the same problem. If I patch s3-upload-stream to remove the part threshold minimum, and set it to 200KiB, then a 200KiB file works fine, but a 200KiB + 1 byte file does not.

I also tried to capture the request at the moment before AWS calls xhr.send(), and the buffer is 5242880 as it should be, same as when MD5 is calculated.

I can verify that the request body that is actually sent, and captured in the network inspector, is the same as the Content-Length that is actually sent (the too-large size), and that the extra data is not junk but just a larger-than-expected chunk of the original file. (Which would also seem to rule out encoding errors.)

It makes me think that there's some behavior I don't understand about passing files to streaming Xhr objects. I'm using filereader-stream, which selects sub-Blobs from File objects before instantiating a FileReader (which produces the underlying chunks of data that are then piped into s3-upload-stream).

I'm using a filereader chunk size of just under 1MiB. I get the same problem if I drop that down to just under 8KiB, or use 1MiB exactly. So the chunk size I choose on the input side seems unrelated.

konklone commented 10 years ago

I was able to fix it!

The issue was that when slicing buffers around to provide the buffer that eventually is passed onto AWS, the slice operation provides a "view" of the original buffer rather than a full copy. Apparently, if you pass a "view" of a Buffer into a streaming XHR in Firefox and Chrome, they will just greedily read the entire thing before sending it.

I verified this by changing these 2 lines from:

// Return the original buffer.
return combinedBuffer.slice(0, partSizeThreshold);

To these:

// Return the original buffer.
var bufCopy = new Buffer(partSizeThreshold);
var slice = combinedBuffer.slice(0, partSizeThreshold);
slice.copy(bufCopy);
return bufCopy;

Inefficient, but it successfully copies the buffer and drops the original reference, and Firefox and Chrome don't exceed their bounds.

I understand you're not jazzed about the idea of supporting this library for browser use, but since it's so close to working, I really think it'd be a missed opportunity. I'm going to work up a pull request that solves the two issues I've spotted so far: setImmediate() and this buffer issue, and I hope you'll consider accepting it.

Victory screenshot:

yesssss

konklone commented 10 years ago

So, interestingly/maddeningly, by forking the repo and patching it to produce #21, the bug has gone away, and I don't seem to need to re-apply the buffer-copying patch above.

I'm not sure why this is -- I tested it under a few conditions, including the original "paste all of setImmediate.js into s3-upload-stream.js" approach, and heavy uploads all worked simply by having changed to pull s3-upload-stream from npm, rather than a frozen file. It's possible I changed something else, but I don't think so.

I'm going to consider this closed until I see it pop up again, but -- I think if you merge #21, someone can use this library in the browser and it should work for multiple parts without a problem.

nathanpeck commented 10 years ago

Okay thanks for identifying this! I'll do some tests and if it works I'll merge it in.

konklone commented 10 years ago

Okay thanks for identifying this! I'll do some tests and if it works I'll merge it in.

That'd be great! Cause I just reproduced it, this time by trying to upload a larger file (200GB) with larger parts (~22MB). The browser actually uploaded more than was promised, even after sending xhr.send() with a Buffer sliced to what should have been the same block as what was MD5'd.

Applying the patch above fixes the issue.

nathanpeck commented 10 years ago

Can you test this locally using the same file?

var uploadBuffer = new Buffer(partSizeThreshold);
combinedBuffer.copy(uploadBuffer, 0, 0, partSizeThreshold);
return uploadBuffer;

I think this should do the same thing as your fix as far as providing in browser support, but be a bit more efficient since it does one copy instead of a slice and a copy.

I committed my proposed fix to the 1.0.5 branch as well if you just want to pull that and checkout for testing locally.

nathanpeck commented 10 years ago

Also I wonder if this doesn't warrant opening an issue on feross/buffer which I assume is what browserify is using? Perhaps there is a way to fix the underlying issue that the slice is not being respected.

konklone commented 10 years ago

I committed my proposed fix to the 1.0.5 branch as well if you just want to pull that and checkout for testing locally.

Just did, and it works perfectly. Thanks for this!

Also I wonder if this doesn't warrant opening an issue on feross/buffer which I assume is what browserify is using? Perhaps there is a way to fix the underlying issue that the slice is not being respected.

Good call, I did in https://github.com/feross/buffer/issues/46 -- though I think he's already detected and addressed this here: https://github.com/feross/buffer/commit/500be9ea3fd820f36e7db6f00bacaf2d86389e0d.

nathanpeck commented 10 years ago

Okay I'll publish this as version 1.0.5 on NPM then.

nathanpeck / s3-upload-stream

Uploading a part submits more content than s3-upload-stream believes it did #20