redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.13k stars 834 forks source link

AWS S3 output fails to upload objects in partitioned path #2869

Open bkh-kl opened 1 month ago

bkh-kl commented 1 month ago

Hello!

I'm using aws_s3 output and would like to utilize certain metadata in the path to enable AWS Glue to identify the partition keys based on the key=value format. (AWS doc)

This is the path example I'd like to upload my objects into: bucket/events/year=2024/month=09/object.gz

However, the moment I add the = character in the path, the output fails with the following error message:

Failed to send message to aws_s3: operation error S3: PutObject, https response error StatusCode: 403, RequestID: XYZ, HostID: XYZ, api error SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.

Is this error caused by my misconfiguration or the output does not support it yet?

I also searched your documentation with no luck to understand if = must be escaped to be able to be used.

Thank you!

mihaitodor commented 1 month ago

Hey @bkh-kl 👋 Thanks for reporting this issue! Unfortunately, I wasn't able to reproduce it using the Localstack Docker container, which seems to accept that path just fine. I also tried replacing = with %3D and the libraries don't attempt to decode it, so you'd get year%3D2024/month%3D09 in the path, which I guess isn't ideal.

I do wonder, though, if the issue might be caused by metadata instead (see docs here). Can you please add a log processor with message: ${! metadata() } to check if any of the metadata fields have invalid values? Also, you could try removing the metadata fields with a mapping processor with meta = deleted().

bkh-kl commented 1 month ago

Thanks @mihaitodor

I removed the metadata function from the path and tried with a fixed value which contains = character:

path: 'v1/events/year=55/stream_2-${! uuid_v4() }.parquet

You are right! in Localstack S3 Bucket it works correctly when I use the above path as you can see in the following screenshot:

Screenshot 2024-09-16 at 11 36 34

However, when I switch the same stream to AWS S3 Bucket, the same error appears:

{"@service":"redpanda-connect","label":"s3_output","level":"error","msg":"Failed to send message to aws_s3: operation error S3: PutObject, https response error StatusCode: 403, RequestID: XYZ, HostID: XYZ, api error SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.","path":"root.output","stream":"stream-2","time":"2024-09-16T10:12:05Z"}

I have also set the force_path_style_urls to False for both Buckets and Streams.

mihaitodor commented 1 month ago

Thanks for checking @bkh-kl! Dunno how to reproduce it without an AWS account, but I see some other projects do use percent encoding for paths (for example https://github.com/peak/s5cmd/pull/280). Maybe give it a shot and see what happens:

path: '${! ["v1", "events", "year=55", "stream_2-%s.parquet".format(uuid_v4())].map_each(e -> e.escape_url_query()).join("/") }'
bkh-kl commented 1 month ago

Unfortunately that took the encoding literal:

Screenshot 2024-09-18 at 15 38 16
mihaitodor commented 1 month ago

OK, thanks for checking! We'll have to try and reproduce it somehow and see what we can do to fix this. If you have experience with Go, please try and see if you can get a hello world example which works.

bkh-kl commented 1 month ago

Thanks @mihaitodor! I don't have experience with Go, but definitely will give it a try..