s3gw-tech / s3gw

Container able to run on a Kubernetes cluster, providing S3-compatible endpoints to applications.
https://s3gw.tech
Apache License 2.0
126 stars 20 forks source link

Unicode object metadata #688

Closed irq0 closed 10 months ago

irq0 commented 1 year ago

S3 Test: test_object_set_get_unicode_metadata

        response = client.get_object(Bucket=bucket_name, Key='foo')
        got = response['Metadata']['meta1']
        print(got)
        print(u"Hello World\xe9")
>       assert got == u"Hello World\xe9"
E       AssertionError: assert 'Hello Worldé' == 'Hello Worldé'
E         - Hello Worldé
E         ?            ^
E         + Hello Worldé
E         ?            ^^

We store metadata as part of the attrs

2023-08-24T13:14:24.293+0000 7f6e19faa6c0 10 req 0 0.003333419s s3:put_obj > atomic_writer::complete accounted_size: 3, etag: 37b51d194a7513e45b56f6524f2d51f2, set_mtime: 1970-01-01T00:00:00.000000000Z, attrs: user.rgw.acl, user.rgw.etag, user.rgw.x-amz-content-sha256, user.rgw.x-amz-date, user.rgw.x-amz-meta-meta1, delete_at: 1970-01-01T00:00:00.000000000Z, if_match: NA, if_nomatch: NA

This might be an RGW issue. The test notes that RGW/RADOS fails this.

l-mb commented 11 months ago

Broken unicode processing is bound to lead to problems in the field, please consider reprioritizing for the next release. @jecluis @vmoutoussamy

irq0 commented 11 months ago

This might also be a boto3 bug/feature - following links suggest that.

https://github.com/ceph/s3-tests/issues/316 https://github.com/boto/boto3/issues/478 https://github.com/boto/botocore/pull/861

The docs on weather object metadata supports unicode is a bit fuzzy https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html. RGWs take on that https://tracker.ceph.com/issues/908

irq0 commented 10 months ago

Next step: Validate if this is us. What does the HTTP payload look like? What do we store in the attrs map?

irq0 commented 10 months ago

Here is what is going over the wire (collected with a mitmproxy between s3 tests and s3gw).

Request:

2023-11-13 18:31:17 PUT http://localhost:7481/sfstest-7cznzosspj982n7zlu611-1/foo
                        ← 200 OK [no content] 19ms
                 Request                                 Response                                  Detail
Host:                   localhost:7480
Accept-Encoding:        identity
User-Agent:             Boto3/1.24.96 Python/3.11.5 Linux/6.5.6-1-default Botocore/1.27.96
Content-MD5:            N7UdGUp1E+RbVvZSTy1R8g==
x-amz-meta-meta1:       Hello World\xc3\xa9
X-Amz-Date:             20231113T173117Z
X-Amz-Content-SHA256:   fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9
Authorization:          AWS4-HMAC-SHA256 Credential=test/20231113//s3/aws4_request,
                        SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date;x-amz-meta-meta1,
                        Signature=938849e3f95269998e8ad3514fa6a3605f7ad05ea6799c5876e38ecee6236c2d
amz-sdk-invocation-id:  96fb4539-5e87-4e75-82b4-2363e651eb6c
amz-sdk-request:        attempt=1
Content-Length:         3

Response to a later GET

2023-11-13 18:31:17 GET http://localhost:7481/sfstest-7cznzosspj982n7zlu611-1/foo
                        ← 200 OK binary/octet-stream 3b 44ms
                 Request                                 Response                                  Detail
Content-Length:     3
Accept-Ranges:      bytes
Last-Modified:      Mon, 13 Nov 2023 17:31:17 GMT
x-rgw-object-type:  Normal
ETag:               "37b51d194a7513e45b56f6524f2d51f2"
x-amz-meta-meta1:   Hello World\xc3\xa9
Content-Type:       binary/octet-stream
Date:               Mon, 13 Nov 2023 17:31:17 GMT
Connection:         Keep-Alive
Raw                                                                                                               [m:auto]
bar

S3GW receives 'Hello World\xc3\xa9' via header and sends that exact string back. The sfs side stores the string as part of the object attrs as a byte string. Wire data just confirmed, that the wrong result in the test is not from our processing. Nice.

Still, where does the weird encoding come from?

With boto3 debug logging we get:

DEBUG    botocore.endpoint:endpoint.py:114 Making request for OperationModel(name=PutObject) with params: {'url_path': '/s
fstest-t4axpnukoczbc4tfy5tip-1/foo', 'query_string': {}, 'method': 'PUT', 'headers': {'User-Agent': 'Boto3/1.24.96 Python/
3.11.5 Linux/6.5.6-1-default Botocore/1.27.96', 'Content-MD5': 'N7UdGUp1E+RbVvZSTy1R8g==', 'x-amz-meta-meta1': 'Hello Worldé', 'Expect': '100-continue'}, 'body': <_io.BytesIO object at 0x7f1a031b8770>, 'url': 'http://localhost:7480/sfstest-t4ax
pnukoczbc4tfy5tip-1/foo', 'context': {'client_region': '', 'client_config': <botocore.config.Config object at 0x7f19ffc8a1
d0>, 'has_streaming_input': True, 'auth_type': None, 'signing': {'bucket': 'sfstest-t4axpnukoczbc4tfy5tip-1'}}}

DEBUG    botocore.endpoint:endpoint.py:265 Sending http request: <AWSPreparedRequest stream_output=False, method=PUT, url=
http://localhost:7480/sfstest-t4axpnukoczbc4tfy5tip-1/foo, headers={'User-Agent': b'Boto3/1.24.96 Python/3.11.5 Linux/6.5.
6-1-default Botocore/1.27.96', 'Content-MD5': b'N7UdGUp1E+RbVvZSTy1R8g==', 'x-amz-meta-meta1': b'Hello World\xc3\xa9', 'Expect': b'100-continue', 'X-Amz-Date': b'20231113T175837Z', 'X-Amz-Content-SHA256': b'fcde2b2edba56bf408601fb721fe9b5c338d1
0ee429ea04fae5511b68fbf8fb9', 'Authorization': b'AWS4-HMAC-SHA256 Credential=test/20231113//s3/aws4_request, SignedHeaders
=content-md5;host;x-amz-content-sha256;x-amz-date;x-amz-meta-meta1, Signature=551d8a42e3246ce25a32b4960ef7d0149b8b9d3a24ee
e24156ee1bcad8285f4b', 'amz-sdk-invocation-id': b'856c155f-9999-4ecd-8bf0-45726de52bc5', 'amz-sdk-request': b'attempt=1',
'Content-Length': '3'}>

Something turns the unicode string 'Hello Worldé' into its UTF-8 encoding b'Hello World\xc3\xa9' that we also see on the wire. This means, that the response decoder doesn't decode the UTF-8 back as we would expect.

The debug logs have:

DEBUG    botocore.parsers:parsers.py:239 Response headers: {'Content-Length': '3', 'Accept-Ranges': 'bytes', 'Last-Modifie
d': 'Mon, 13 Nov 2023 17:58:37 GMT', 'x-rgw-object-type': 'Normal', 'ETag': '"37b51d194a7513e45b56f6524f2d51f2"', 'x-amz-m
eta-meta1': 'Hello Worldé', 'Content-Type': 'binary/octet-stream', 'Date': 'Mon, 13 Nov 2023 17:58:37 GMT', 'Connection':
 'Keep-Alive'}

Another data point: The weird decoding is what we get if we decode utf-8 with latin1:

>>> 'Hello Worldé'.encode('utf-8').decode("latin1")
'Hello Worldé'

According to https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html non-ascii metadata is supposed to be RFC 2047 encoded. Indeed if we pass the encoded UTF-8 to python's decoder we get

>>> email.header.decode_header('Hello World\xc3\xa9')
[('Hello Worldé', None)]

So, it seems to default to latin1 decode if there isn't any encoding information.

Is there an easy way to fix the test?

  1. Use put_object Metadata arg

If we change the test to use the put object metadata attribute to send the unicode data

# before
# def set_unicode_metadata(**kwargs):                             │
#     kwargs['params']['headers']['x-amz-meta-meta1'] = u"Hello World\xe9"
                                                                       │
# client.meta.events.register('before-call.s3.PutObject', set_unicode_metadata)
# after
client.put_object(Bucket=bucket_name, Key='foo', Body='bar', Metadata={'meta1': u"Hello World\xe9"})

we actually get a

E               botocore.exceptions.ParamValidationError: Parameter validation failed:
E               Non ascii characters found in S3 metadata for key "meta1", value: "Hello Worldé".
E               S3 metadata can only contain ASCII characters.

See also https://github.com/boto/botocore/issues/2552

  1. Send RFC 2047
     def set_unicode_metadata(**kwargs):                               │
=        kwargs['params']['headers']['x-amz-meta-meta1'] = email.charset.Charset("utf-8").header_encode("Hello
 World\xe9")

Nope: E AssertionError: assert '=?utf-8?q?He...World=C3=A9?=' == 'Hello Worldé'

  1. Send latin1 encoded

Does not really work, the header dict expects a string and converts internally to bytes. So we would get into a double encoding problem.

     def set_unicode_metadata(**kwargs):                               │
=        kwargs['params']['headers']['x-amz-meta-meta1'] = str("Hello World\xe9".encode("latin1"))

->

E       assert "b'Hello World\\xe9'" == b'Hello World\xe9'

I guess adding a special string type that doesn't decode would work, but this is getting silly.

In total I think we are doing the right thing with just storing metadata as bytes without processing.

To double check, here's the RFC 2047 encoded headers:

PUT
x-amz-meta-meta1:       =?utf-8?q?Hello_World=C3=A9?=
GET
x-amz-meta-meta1:   =?utf-8?q?Hello_World=C3=A9?=

Close?