jchristn / Less3

Less3 is an S3-compatible object storage server that runs on your laptop, servers, just about anywhere!
MIT License
54 stars 9 forks source link

Chunked transfer issue #3

Closed iain-cyborn closed 2 years ago

iain-cyborn commented 2 years ago

I think this is really a WatsonWebserver issue but it manifests in Less3.

The AWS S3 SDK PutObjectRequest class has a boolean UseChunkEncoding attribute and by default this is set to true. This influences the request that the S3 SDK makes to the S3 API and the S3ServerLibrary (another dependency of Less3) correctly reads that and sets the Chunked attribute of the S3Request class: https://github.com/jchristn/S3Server/blob/b449b11b9159fa0f8183b8c360c995eeba41dc4f/S3Server/S3Request.cs#L517-L524 if (HeaderExists("x-amz-content-sha256", false)) { string sha256content = RetrieveHeaderValue("x-amz-content-sha256"); if (!String.IsNullOrEmpty(sha256content)) { if (sha256content.Contains("STREAMING")) Chunked = true; } }

Then the Less3 code will act on that attribute of the S3Request class to determine if ReadChunk should be called: https://github.com/jchristn/Less3/blob/bedb1dbb89b52e81052f19a592d8f68d5e5e58c4/Less3/Api/S3/ObjectHandler.cs#L868-L872

And that ReadChunk function is implemented in WatsonWebserver where the first line of the function is: if (!ChunkedTransfer) throw new IOException("Request is not chunk transfer-encoded.");

Where that ChunkedTransfer attribute is set from request headers: https://github.com/jchristn/WatsonWebserver/blob/b3833c2edc7bd27d4c686e49efa2b0b412905e83/WatsonWebserver/HttpRequest.cs#L220-L223 if (curr.Key.ToLower().Equals("transfer-encoding")) { if (curr.Value.ToLower().Contains("chunked")) ChunkedTransfer = true;

But the AWS S3 SDK does not seem to set transfer-encoding header despite actually chunking the data. Which resulted in the first attempt to use my AWS S3 SDK code hitting that "Request is not chunk transfer-encoded" exception. So there's a conflict between the attribute that determines if Less3 calls ReadChunk and the attribute that WatsonWebserver expects to be set when ReadChunk is called.

Yes, there is a workaround of specifically setting UseChunkEncoding to false in the PutObjectRequest class but my preference would be to use the default behaviour of the AWS S3 SDK and not have to specifically set UseChunkEncoding to false.

jchristn commented 2 years ago

Hi @iain-cyborn thanks for your note. And sorry for the delay in getting back to you, for some reason I haven't received any notifications on issues filed against a small handful of repositories.

Thanks for letting me know about this issue. This is kind of tricky, because if I amend the platform, it would only ever allow one of either chunked or non-chunked requests. The solution I have in mind is a boolean in settings that forces chunked transfer-encoding. I'd have to make the setter for that property public in Watson Webserver. Another challenge is that most of the other client implementations (Cyberduck, S3 Browser) either 1) don't use chunked transfer-encoding, or, 2) do set this header. Forcing the behavior in the platform creates something of a catch-22.

Are you aware of any other means by which to tell if the SDK is chunking?

iain-cyborn commented 2 years ago

Hi Joel, thanks for responding.

I guess you could say it's Amazon's fault for only using their own headers. This issue on their repo (which they closed) may be related: https://github.com/aws/aws-sdk-net/issues/678

Anyway, I put a breakpoint in the PreRequestHandler function in Less3 and noted the values in the array ctx.Http.Request.Headers.

First with no explicit setting of PutObjectRequest.UseChunkEncoding, so according to docs it will default to true, I made the AWS S3 SDK PutObjectAsync call and at the breakpoint these were the header values: {[amz-sdk-invocation-id, 6167b961-1023-4cf1-b9a9-6d60e260d9c1]} {[amz-sdk-request, attempt=1; max=5]} {[X-Amz-Date, 20220322T102009Z]} {[X-Amz-Decoded-Content-Length, 39]} {[X-Amz-Content-SHA256, STREAMING-AWS4-HMAC-SHA256-PAYLOAD]} {[Content-Length, 212]} {[Content-Type, application/octet-stream]} {[Authorization, AWS4-HMAC-SHA256 Credential=IBV3XGQNNPA7F5WCPW7V/20220322/us-east-1/s3/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length, Signature=]} {[Expect, 100-continue]} {[Host, localhost:8000]} {[User-Agent, aws-sdk-dotnet-coreclr/3.7.8.4 aws-sdk-dotnet-core/3.7.8.6 .NET_Core/6.0.3 OS/Microsoft_Windows_10.0.19044 ClientAsync]}

Then I set PutObjectRequest.UseChunkEncoding to false and repeated the PutObjectAysnc call and these were the header values: {[amz-sdk-invocation-id, 334fd92c-c34f-4a0e-8ed6-6583ab148f78]} {[amz-sdk-request, attempt=1; max=5]} {[X-Amz-Date, 20220322T103606Z]} {[X-Amz-Content-SHA256, afe400efee7b51b8f1a6621b12837a7bd9bb33061b8dd0024a6ab5309f94623f]} {[Content-Length, 39]} {[Content-Type, application/octet-stream]} {[Authorization, AWS4-HMAC-SHA256 Credential=IBV3XGQNNPA7F5WCPW7V/20220322/us-east-1/s3/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date, Signature=]} {[Expect, 100-continue]} {[Host, localhost:8000]} {[User-Agent, aws-sdk-dotnet-coreclr/3.7.8.4 aws-sdk-dotnet-core/3.7.8.6 .NET_Core/6.0.3 OS/Microsoft_Windows_10.0.19044 ClientAsync]}

So, yes tricky one. I assume WatsonWebServer is for generic operation but Amazon seem intent on not using standard headers. I did check the chunked content generated by the AWS S3 SDK and it would be correctly interpreted by the WatsonWebServer ReadChunk function so at least they stick to standard way sometimes.

I don't think there is a generic solution, you'd need some mechanism by which Watson was "aware" it was being used by Less3 for special case handling. Maybe Less3 could provide it with a delegate function that would be null if anything else was using it that would override the transfer-encoding header check.

jchristn commented 2 years ago

As much as I hate to respond this way, I'm compelled to feel that the appropriate solution is to add a caveat to the documentation indicating that the AWS SDK should be amended. The primary reason being the only embodiment I can see is either a global setting that forces the assumption that the client is using chunked encoding, which has the drawback of potentially alienating a material portion of the ecosystem (3rd party apps like Cyberduck, S3 Browser).

I know it's not ideal, but that's where my head is. Thoughts?

iain-cyborn commented 2 years ago

It's certainly an option and a minor change to anywhere that creates an AWS S3 SDK PutObjectRequest object.

One thought I did have that I don't think would impact 3rd party apps (although there may be implications I'm missing) is that if the WatsonWebServer function became: public async Task<Chunk> ReadChunk(CancellationToken token = default, bool AssumeChunked = false) {if (!ChunkedTransfer && !AssumeChunked) throw new IOException("Request is not chunk transfer-encoded.");

Then it seems to me that nothing that already calls ReadChunk, even if passing a CancellationToken, would need to change. However the Less3 use could become  if (ctx.Request.Chunked){   while (true)   {    Chunk chunk = await ctx.Request.ReadChunk(AssumeChunked:ctx.Request.Chunked);

Since we do know that ctx.Request.Chunked is being correctly set according to the AWS headers in the S3Server code. Actually since the call is already inside an if clause for ctx.Request.Chunked being true then ctx.Request.ReadChunk(AssumeChunked:true) would be fine.

I'm operating with zero knowledge of 3rd party interaction with Watson but as I see it only Less3 would need to provide the named parameter to ReadChunk. Although this might also be affected by C# version because there have been optional parameter changes during C# history so retaining compatibility with much older versions of the language could be a constraint.

...a short while later: Ah, the ReadChunk call also goes through the S3Server code in the S3Request object which might complicate that idea because the optional parameter would also need added there which could impact a whole lot of other things. If it can't be done then it can't be done. I have no real objection to you closing the issue if you feel that insisting PutObjectRequest.UseChunkEncoding being set to false is a pre-requisite to using AWS S3 SDK with Less3.

iain-cyborn commented 2 years ago

Another thought that occurred to me is that a variant of ReadChunk() could be added to WatsonWebServer, say ReadLess3Chunk(), that only S3Server and Less3 would use. Although it's a bit of a cludge to duplicate a perfectly serviceable function for one special case because of the impact that has on future maintainability, when it seems only one person has run into the issue of the AWS S3 SDK using Amazon's own chunking header.

jchristn commented 2 years ago

Hi @iain-cyborn the more I think about it, the more I realize that A) HttpRequest.ReadChunk is a method that a developer's code will invoke, and it's their responsibility to either 1) check for chunked transfer or 2) use it with the appropriate understanding, and B) many properties in HttpRequest should probably be able to be modified by the developer to support their use cases.

I'll make the change to Watson and then get Less3 updated today. These guardrails were primarily to maintain safety, but, your example and some I've seen from others make it pretty clear that the guardrails are too restrictive. Will report back soon!

jchristn commented 2 years ago

Making this change in S3Server.S3Request

                if (HeaderExists("x-amz-content-sha256", false))
                {
                    ContentSha256 = RetrieveHeaderValue("x-amz-content-sha256");
                    if (!String.IsNullOrEmpty(ContentSha256))
                    {
                        if (ContentSha256.ToLower().Contains("streaming"))
                        {
                            Chunked = true;
                            _HttpRequest.ChunkedTransfer = true;
                        }
                    }
                }
jchristn commented 2 years ago

Ok - made a bunch of changes:

Could you let me know if it behaves as expected now without the modification to the PutObjectRequest code? Cheers!

iain-cyborn commented 2 years ago

Mixed results. On the plus side the PutObject call apparently works without have to set UseChunkEncoding to false in PutObjectRequest. On the negative side the file that is stored on disk is zero length. If I revert back to setting UseChunkEncoding to false then the file stored on disk has content. So that would point to ReadChunk being a potential issue.

I put a breakpoint on Chunk chunk = await ctx.Request.ReadChunk(); in ObjectHandler.Write and after a single call the chunk attributes look like chunk.Data = null chunk.IsFinalChunk = true chunk.Length = 0

However if I look in ctx.Request.DataAsString before running ReadChunk it looks like this "27;chunk-signature=c47420eedec1c2370bb5a16b80d282244790d3ae3369fb1a401c5cb49412f619\r\nCloudlet test data: created at 09:59:50\r\n0;chunk-signature=b338f2fd4775cd936fa244a1ba27470a2c58581c9882b471620ed802889a176e\r\n\r\n"

So I would have expected the first ReadChunk call to return a result with chunk.Length = 27 and then a second ReadChunk call to return the zero length final chunk but that doesn't seem to be what is happening.

jchristn commented 2 years ago

Thanks for testing this @iain-cyborn - I will look at this today. Definitely problematic :(

jchristn commented 2 years ago

Problem appears to be in the S3Server library. WatsonWebserver is working fine. Working on it.

jchristn commented 2 years ago

Ok I think I have it working, could you try the latest version?

Commit: https://github.com/jchristn/Less3/commit/a636f02ac6f36d656d7d78a64aabc158cdd6f5cf

Thanks for your help @iain-cyborn

iain-cyborn commented 2 years ago

Better but still not there.

First PutObject apparently worked but a second failed. So I dug deeper and noticed that the first put didn't completely work. The data sent would have been like "Cloudlet test data: created at 09:59:50" (39 bytes long) but what was ending up in the disk file was just "Cloudlet test data: created" (27 bytes long). So it looks like somewhere is reading the chunk length hex value 27 as an integer instead of properly converting to decimal value of 39. I put in a breakpoint to view ctx.Request.DataAsString before the stream got used on the second PutObject that was completely failing and it started "2A;chunk-signature=" so that's definitely going to cause problems if treated as a decimal.

Now Watson does have a chunk.Length = int.Parse(lenStr, NumberStyles.HexNumber); line in ReadChunk but I don't think that's getting called but rather the chunk.Length = Convert.ToInt32(lenStr); line after the read string has been split around ";" to length and metadata.

I have no idea if there's any kind of defined standard to indicate chunk length in hexadecimal or decimal.

jchristn commented 2 years ago

Interesting. The standard is decimal: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding

It looks like the only way I'll be able to solve this is to put awareness of that specific header in Watson itself, and have the chunk reader be aware of the required conversion.

Thanks for all of your help in this, I'll report back soon. Cheers

jchristn commented 2 years ago

And I found this, appears they aren't using the standard encoding for chunk metadata :(

image

Taken from: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html

iain-cyborn commented 2 years ago

Amazon being non-standard again, seems to be a recurring theme.

jchristn commented 2 years ago

Actually I stand corrected. It is always a hex-encoded value. https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1

Total facepalm moment. I am fixing this now. So thankful you pointed this out.

jchristn commented 2 years ago

Update to Watson Webserver (v4.2.2.10, now aware of the Amazon streaming header directly) NuGet: https://www.nuget.org/packages/Watson/4.2.2.10 Commit: https://github.com/jchristn/WatsonWebserver/commit/4c6f888dabaedfb8224d83c24ed7af40ed359fda

Update to S3Server (v4.0.1.12, Watson dependency update) NuGet: https://www.nuget.org/packages/S3Server/4.0.1.12 Commit: https://github.com/jchristn/S3Server/commit/8f652d5e41bc7283b8002f3474eac51ef76efa66

Update to Less3 (v2.0.1.5, S3Server/Watson dependency update) NuGet: https://www.nuget.org/packages/Less3/2.0.1.5 Commit: https://github.com/jchristn/Less3/commit/b26232dce896b47b60c9e15c82cfa62da986e4df

Please let me know if this fixes the issue!

jchristn commented 2 years ago

Actually, wait, one more issue. Sorry

jchristn commented 2 years ago

Watson Webserver v4.2.2.11 NuGet: https://www.nuget.org/packages/Watson/4.2.2.11 Commit: https://github.com/jchristn/WatsonWebserver/commit/f3fa12b2de6dd2d246192b5b0e70b42b427ed24e

S3Server v4.0.1.13 NuGet: https://www.nuget.org/packages/S3Server/4.0.1.13 Commit: https://github.com/jchristn/S3Server/commit/7d48d395b8c6bd534fba95c2930efe92f0ec7bca

Less3 v2.0.1.6 NuGet: https://www.nuget.org/packages/Less3/2.0.1.6 Commit: https://github.com/jchristn/Less3/commit/ed661da59d8bf6c309f73cf5fadc42526bd40694

Hopefully this will do it!

iain-cyborn commented 2 years ago

Hi Joel,

Yes, all looking good. I used my AWS S3 SDK client test code pointing at Less3 to put and get up to 250kB worth of data and it all worked without having to set UseChunkEncoding to false and all the files in the Objects folder on disk were of expected size.

Thanks for your efforts.

I am happy for the issue to be closed.

jchristn commented 2 years ago

Thank you @iain-cyborn - I added you to the readme as a contributor in a few places. Really appreciate you taking the time to find, report, and help debug the issue. Have a great day!