opensearch-project / logstash-output-opensearch

A Logstash plugin that sends event data to a OpenSearch clusters and stores as an index.
https://opensearch.org/docs/latest/clients/logstash/index/
Apache License 2.0
104 stars 80 forks source link

[BUG] Receiving signature mismatch issue with SigV4 #207

Closed dlvenable closed 1 year ago

dlvenable commented 1 year ago

Describe the bug

In a setup where Logstash is configured to write to an Amazon OpenSearch Service domain. This plugin is using SigV4 signing in this setup.

When a large volume of data is being processed, the plugin will sometimes get an error indicating the request signature does not match the expected signature.

[2023-04-06T16:49:36,735][ERROR][logstash.outputs.opensearch][output_oss][oss_output_cluster] Encountered a retryable error (will retry with exponential backoff) {:code=>403, :url=>"https://vpc-domain.ca-central-1.es.amazonaws.com:443/_bulk ",
:content_length=>188317, :body=>"{"message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.\n\nThe Canonical String for
this request should have been\n'POST\n/_bulk\n\ncontent-type:application/json\nhost:vpc-domain.ca-central-1.es.amazonaws.com \nx-amz-content-sha256:***\nx-amz-date:20230406T164936Z\nx-amz-security-token:TOKEN\ncontent-type;host;x-amz-content-sha256;x-amz-date;x-amz-security-token\nTOKEN'\n\nThe
String-to-Sign should have been\n'AWS4-HMAC-SHA256\n20230406T164936Z\n20230406/ca-central-1/es/aws4_request\nTOKEN'\n"}"}
dblock commented 1 year ago

I debugged a similar problem. Can't share logs, but I wrote a quick script to process logstash logs: https://gist.github.com/dblock/0454d44d45bae82837b9c444b5bb0580

I'm seeing that a single connection that started failing, but it did recover for some subsequent requests.

http-outgoing-18: ........., total=9, success=9, failure=0
http-outgoing-15: ........, total=8, success=8, failure=0
http-outgoing-17: ......, total=6, success=6, failure=0
http-outgoing-14: ........., total=9, success=9, failure=0
http-outgoing-7: ., total=1, success=1, failure=0
http-outgoing-12: ........., total=9, success=9, failure=0
http-outgoing-16: xxx..xxxxx.., total=12, success=4, failure=8
http-outgoing-8: ...., total=4, success=4, failure=0
http-outgoing-13: ..., total=3, success=3, failure=0
http-outgoing-6: ......., total=7, success=7, failure=0
http-outgoing-9: ......., total=7, success=7, failure=0
http-outgoing-3: ....., total=5, success=5, failure=0
http-outgoing-4: ......, total=6, success=6, failure=0
http-outgoing-5: ......, total=6, success=6, failure=0
http-outgoing-10: ......, total=6, success=6, failure=0
http-outgoing-11: .., total=2, success=2, failure=0

Looking at that one connection, there's something interesting with the content-lengths.

2023-06-29T20:25:08-04:00   [2023-06-30T00:25:08,806][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 141208[\r][\n]"
2023-06-29T20:25:08-04:00   [2023-06-30T00:25:08,848][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 263738[\r][\n]"
2023-06-29T20:25:09-04:00   [2023-06-30T00:25:09,023][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 219374[\r][\n]"
2023-06-29T20:25:09-04:00   [2023-06-30T00:25:09,148][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 4901[\r][\n]"
2023-06-29T20:25:09-04:00   [2023-06-30T00:25:09,239][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 23419[\r][\n]"
2023-06-29T20:25:10-04:00   [2023-06-30T00:25:10,949][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 175362[\r][\n]"
2023-06-29T20:25:11-04:00   [2023-06-30T00:25:11,200][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 171756[\r][\n]"
2023-06-29T20:25:11-04:00   [2023-06-30T00:25:11,217][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 171912[\r][\n]"
2023-06-29T20:25:11-04:00   [2023-06-30T00:25:11,333][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 191522[\r][\n]"
2023-06-29T20:25:11-04:00   [2023-06-30T00:25:11,699][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 165799[\r][\n]"
2023-06-29T20:25:12-04:00   [2023-06-30T00:25:12,148][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 3809[\r][\n]"
2023-06-29T20:25:12-04:00   [2023-06-30T00:25:12,402][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-16 >> "Content-Length: 3674[\r][\n]"

This correlates really well, larger payloads fail and smaller payloads succeed, even on the same connection.

I do have plenty of other connections that succeed with larger payloads on a different host, so this isn't a size problem.

2023-06-29T20:02:47-04:00   [2023-06-30T00:02:47,304][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-43 >> "Content-Length: 174475[\r][\n]"
2023-06-29T20:02:47-04:00   [2023-06-30T00:02:47,367][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-36 >> "Content-Length: 183893[\r][\n]"
2023-06-29T20:02:47-04:00   [2023-06-30T00:02:47,535][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-37 >> "Content-Length: 173939[\r][\n]"
2023-06-29T20:02:47-04:00   [2023-06-30T00:02:47,556][DEBUG][org.apache.http.wire     ][output_oss][oss_output_cluster] http-outgoing-43 >> "Content-Length: 177570[\r][\n]"
dblock commented 1 year ago

There's an interesting bug found in the opensearch-ruby that could be a root cause of behavior like this: https://github.com/opensearch-project/opensearch-ruby/issues/141

dblock commented 1 year ago

I got it. The culprit is something like 'boîte de réception'.encode("ISO-8859-1").

#!/usr/bin/env ruby
$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '../lib'))

require 'bundler/setup'
require 'logstash-core'
require 'logstash/plugin_mixins/ecs_compatibility_support'
require 'logstash/outputs/opensearch'
require 'logstash/codecs/plain'
require 'logstash/instrument/namespaced_null_metric'
require 'cabin'

java_import 'java.util.concurrent.Callable'
java_import 'java.util.concurrent.FutureTask'
java_import 'java.util.concurrent.LinkedBlockingQueue'
java_import 'java.util.concurrent.ThreadPoolExecutor'
java_import 'java.util.concurrent.TimeUnit'

puts "Running on #{RUBY_PLATFORM} ..."

host = 'https://....es.amazonaws.com:443'

auth = {
    "type" => 'aws_iam',
    "aws_access_key_id" => ENV['AWS_ACCESS_KEY_ID'],
    "aws_secret_access_key" => ENV['AWS_SECRET_ACCESS_KEY'],
    "session_token" => ENV['AWS_SESSION_TOKEN'],
    "region" => ENV['AWS_REGION']
}

index_name = "my-index"

settings = {
    "hosts" => host,
    "index" => index_name,
    "auth_type" => auth
}

config = LogStash::Outputs::OpenSearch.new(settings)
config.register

string = 'boîte de réception'.encode("ISO-8859-1")
events = [
    LogStash::Event.new("message" => string)
]

config.multi_receive(events)

causes

[[{"index"=>{:_id=>nil, :_index=>"my-index", :routing=>nil}}, {"message"=>"bo\xEEte de r\xE9ception", "@version"=>"1", "@timestamp"=>#<LogStash::Timestamp:0x28bc9749>}]]
[ERROR] 2023-08-22 17:20:09.330 [main] opensearch - Encountered a retryable error (will retry with exponential backoff) {:code=>403, :url=>".../_bulk", :content_length=>146, :body=>"{\"message\":\"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.\\n\\nThe Canonical String for this request should have been\\n'POST\\n/_bulk\\n\\ncontent-type:application/json\\nhost:...\\nx-amz-content-sha256:9d7ad54b413cb0adf982b2b2ea28750f366d4440f90ca63e565ec4c18d0ebfb1\\nx-amz-date:20230822T172009Z\\nx-amz-security-token:IQoJb3JpZ2luX2VjEDkaCXVzLWVhc3QtMSJHMEUCIQC2dwXmqTklolL66V65gxR8c+vVjbs35hMwQ8QXqPJyYQIgRaUaBFMi/Dtmw7CBMyIXSd1dGOf1cbRJW+RBPjGlSocqpgII8f//////////ARAEGgwwMTMzMDYyNDI2NTEiDCrKXd2eLwyHbDEfuyr6AQwW9iMzEhXoc1Y5XSUBrnOz2p0z1n8QqQaHbcLebSNecTR8Qdb/NW+XyOYhpkybyjDekjyQtmxfYrjXWrgKYzlsrOYw3cGjYR65zF237BLoE0RtgzPTa8k6MvtYOeHRiiJYfVCP6mNcwC07rBqC/Kkbpr5UUa24gfX4bwnNZVzIGkTkBXacXdiZAvDjF6fafeOyCjQHukteYECkGu6IZ2S8zNo9WRHAPZTsufE0Q7Q5LnZ4aFK6xEm7T33sTgmW6OM2fb3HTaT/+/wv4FB03g8rURIVp1OglNMJImH9h9wfQII9fgUQr/O42dpTS8Va/8qDzOx0F0RZ+FwwhrmTpwY6nQFXLXrKLZuqBjIcBEIIjaEsqogE1ojjWM1mo98x/LmVb3LvuPnzJ6y/A75ETs82r0dh+DC0KAWmG5KARKv50fV1uR7sKCNl92bJmJtwPo5dmES19RCEld2WGm8IZyb5TXrWe3WZt4iUPeW6bfg1IRTOyI3kXEFWhca854WLUfgdT42bpXRYnoAmyWuNRVcO+TKsDeALmgM+VoDEuxXY\\n\\ncontent-type;host;x-amz-content-sha256;x-amz-date;x-amz-security-token\\nf4b76bd989acd2cf009c9e1db13936cd9a42d0171059e64f68aabea314ce750f'\\n\\nThe String-to-Sign should have been\\n'AWS4-HMAC-SHA256\\n20230822T172009Z\\n20230822/us-west-2/es/aws4_request\\na7c6b12e3b00c802279f11ab89e70a3a23e035be29836c9b570f96692adb5d6c'\\n\"}"}
kffallis commented 1 year ago

are we looking to backport this to prior versions as it seems to be found in multiple?

dblock commented 1 year ago

are we looking to backport this to prior versions as it seems to be found in multiple?

We could if it's necessary, @dlvenable what versions do we still maintain?