opensearch-project / opensearch-ruby

Ruby Client for OpenSearch
Apache License 2.0
94 stars 47 forks source link

inconsistent signature signing failures on PUT updates when using sigv4 signing #141

Closed phoffer closed 3 months ago

phoffer commented 1 year ago

What is the bug?

We are using the OpenSearch::Aws::Sigv4Client according to instructions, but we have signature errors when trying to update documents. We are using IAM users.

We have two indices in Opensearch and 80% PUT calls are success for one index (user data), but the other index (product info) fails at 100% when we try to update documents. We can successfully import all the data correctly to begin with.

We are switching over from Elasticsearch, and have a lot built out using the previous Elasticsearch gems that these were continued from.

How can one reproduce the bug?

This is how we are setting up the Opensearch client:

    config = { url: Rails.application.config.opensearch_url, port: URI.parse(Rails.application.config.opensearch_url).port }
    aws_iam_credentials = Aws::AssumeRoleWebIdentityCredentials.new(
      region: Rails.application.config.aws_region,
      role_arn: Rails.application.config.aws_iam_role_arn,
      web_identity_token_file: Rails.application.config.aws_web_identity_token_file
    )

    signer = Aws::Sigv4::Signer.new(
      service: 'es',
      region: Rails.application.config.aws_region,
      credentials_provider: aws_iam_credentials
    )
    OpenSearch::Aws::Sigv4Client.new({ **config, adapter: :typhoeus, log: true }, signer)

We have also tried using access/secret like this:


    signer = Aws::Sigv4::Signer.new(service: 'es',
                                    region: Rails.application.config.aws_region,
                                    access_key_id: ENV['OPENSEARCH_ACCESS_KEY'],
                                    secret_access_key: ENV['OPENSEARCH_SECRET_KEY']
                                    )
    OpenSearch::Aws::Sigv4Client.new({ **config, adapter: :typhoeus, log: true }, signer)

What is the expected behavior?

A clear and concise description of what you expected to happen. We expect 100% of updates to be successful

What is your host/environment?

Kubernetes EKS. 1.21

We are using the current HEAD for this repo in our Gemfile (both OS-ruby and sigv4) to include the recent fixes.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem. Log example:

[403] {"message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'PUT
/users_staging_index/_doc/31794036

host:vpc-trr-opensearch-staging-joqhkufrgzviea7u5sgnj7tota.us-east-1.es.amazonaws.com
x-amz-content-sha256:fb489c2eba88386ac1d8670360ad211dac0268eaa8ef6bf9360c02f071c37bf1
x-amz-date:20230209T074519Z

host;x-amz-content-sha256;x-amz-date
259087aee273e588d1f839bb37f507a9e76da88e04251eef98f271f854651477'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20230209T074519Z
20230209/us-east-1/es/aws4_request
fde9b011da5d8acd05b802df02a64bff816cf51df244bde6226d58dc000a1fb9'
"}

Do you have any additional context?

Add any other context about the problem. It seems similar to this issue https://github.com/amzn/selling-partner-api-models/issues/774

wbeckler commented 1 year ago

Between the two requests, those that are working and those that aren't, are there differences between the requests in terms of the verb, the existence of query parameters, and the existence of a body? This will help debug this.

Even better, if you're willing to try, would be if you could create a minimal example using dummy data of the exact request that fails.

nhtruong commented 1 year ago

@phoffer I'm unable to reproduce the issue (even after switching the adapter to typhoeus like you did). I wonder if it has anything to do with the **config passed to the constructor. Can you also specify a PUT endpoint that causes this issue?

phoffer commented 1 year ago

Sorry for delayed response, I was unexpectedly out of office a bit. To answer both of your questions:

  1. Are the differences in terms of the verbs, query parameters, body existence? No, we can see matching requests requests fail. Example of some requests, all of which have the same data shape in the body.
PUT https://redacted.es.amazonaws.com:443/users_staging_index/_doc/31794768 [status:201, request:0.039s, query:n/a]
PUT https://redacted.es.amazonaws.com:443/users_staging_index/_doc/31794126 [status:200, request:0.040s, query:n/a]
PUT https://redacted.es.amazonaws.com:443/users_staging_index/_doc/31794098 [status:403, request:0.028s, query:N/A]
  1. Produce an example app? I can't, as this only seems to pop up in our AWS hosted environments, and I don't have access to those services. I will go and ask, but we have pretty strict access policies.

  2. **config will just explode the config hash values into the other hash, so in effect, this is

OpenSearch::Aws::Sigv4Client.new({
  url: Rails.application.config.opensearch_url, port: URI.parse(Rails.application.config.opensearch_url).port,
  adapter: :typhoeus, log: true
}, signer)

The variable gets used differently for non-AWS environments, which is why it's not all together like this snippet.

Additionally, this seems to still occur when we use the AWS signing, but with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables instead of role_arn strategy.

wbeckler commented 1 year ago

Thank you for that extra info! @nhtruong can you think of next steps for debugging this?

nhtruong commented 1 year ago

I'll start working on this later today.

phoffer commented 1 year ago

We did some additional testing Friday and this morning and have a little more information we can provide.

We added user/pass authentication to the Opensearch cluster, using the same policies as the IAM user we had been attempting to use, and then used the Opensearch Client instead of the sigv4 wrapper client. This is how we initialized the client:

OpenSearch::Client.new(
    host: Rails.application.config.opensearch_url,
    user: ENV['OPENSEARCH_USER'],
    password: ENV['OPENSEARCH_PASS'],
    adapter: :typhoeus,
    log: true
  )

With this setup, everything worked as we expected. This makes me wonder/think there is something going on with the wrapping of the Opensearch client.

Is there any other info I can provide to help?

nhtruong commented 1 year ago

The sigv4 wrapper simply adds the Sigv4 headers to the request before sending it to AWS Cluster. It's only needed for services that require Sigv4 authentication. It looks like yours doesn't need sigv4 since you can use the regular client to communicate with the server?

phoffer commented 1 year ago

Our goal is to use IAM roles to verify access and requests to our newer infrastructure pieces (Opensearch is the first we've added and tried to do this with). Our devops team has a preference of using IAM policies over adding user/pass access to services. It seemed that the sigv4 wrapper is required in order to do IAM role based authentication. Is that the case, or am I misunderstanding how these pieces come together?

nhtruong commented 1 year ago

Yes, to use IAM based auth, you will have to sign your requests with Sigv4, hence the Sigv4 client. The Internal Username/Password approach didn't get any sigv4 error be because that method of auth doesn't require sigv4 signing.

The fact that nearly identical requests can sometimes fail the Sigv4 auth is very puzzling to me. I have a hunch that it's not the client, because of the flakiness nature of this bug. I still couldn't replicate that error. Here what's I've tried:

nhtruong commented 1 year ago

I'm going to update the Sigv4 client with debugging capability (like printing out the canonical request). Then we can better help you with this issue ourselves or escalate it to AWS OS Service.

phoffer commented 1 year ago

That debugging capability would be really helpful! I appreciate you tried those other aspects in debugging. I think we'll be able to get a lot more info with debugging updates in the client. Thank you!

nhtruong commented 1 year ago

@phoffer just released opensearch-aws-sigv4 1.2.0 with debug feature. Instructions on how to turn it one can be found here: https://github.com/opensearch-project/opensearch-ruby/blob/a3c308389c5abdf71ccd50f00a6c8ececf9c7a6d/opensearch-aws-sigv4/USER_GUIDE.md#debugging

phoffer commented 1 year ago

Here is an example with the debugging added:

ERROR:

{"message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

The Canonical String for this request should have been
'PUT
/products_staging_index/_doc/27625530

host:redacted.us-east-1.es.amazonaws.com
x-amz-content-sha256:9ae812aa93eb349823948c6b9612c8e33742c9e77b5c09e17936de952147459b
x-amz-date:20230228T224146Z
x-amz-security-token:IQoJb3JpZ2luX2VjENf//////////wEaCXVzLWVhc3QtMSJHMEUCIQDP4+i2bhcJrsfEneblkk2nctfHgYWZ3GJ0fzfa08n8zgIgUn/tmEbKhLLR0eb36CMC75Q85hxDfnjA5pR/ZoLRlZMqnwUIgP//////////ARACGgwzMTAyNDIyMTY5MzEiDN1z6aMoRwTmvkmdKirzBGw+r5rq3kKURuRWuuyb9G+67Y3nyW6dPn4jWlkVMA9bEmocNS9nhjaMm/p7TW7M5p+l9jl27QXwHeKSjiVCbYNkpKZLqCjdbgLn23d5AXO3hFKbR8k3Idw0pPqCZIstx1d5cwOHpJOGdcJGvljWFqRnw41k24TG75oltuswrkYDeIUlXMFDzrMc2H7PNGFHfPw2BCDAxLppMFT8Nk1s/9er6hOuR9K8hWc5CnmkAMowMgM6JuHuwnA2V0yoqX32NXd/NIJa1jQnzaTRpxYkwpU9FrmsAa4pSkKulSauRKqs9xZ0gDkD4BtTmoyYDhLPuOEKgPtfTnFst/Fglm4EwqORBLhkUX32Do2w4MS6S+j0S3ceuTg8nnMdYhbPo5xWhVwwVLTWpk6dj8lLurQbXw8iFd0el7Fv27pTHKCEOHQLR42rOWvXXn9Q+jtEeuWN5wyryVbW4PfQIMKNI1Vli4aEVfwdawKTcMdVwGkUKFxTMRIlTtyznR43qc7t8/oBrpE2o2UwXx8lQsqxOAEBLI0vH/CIwMUdBAcYeV4zt3K6t9sF2yeki9XWb9cspsBCUUkBYeUkEeiMvYUDKJ332g6Jvl1N1ttHS90nm7C8LQc6tCHXQoCpvTNxDl7JgeKNkofu8swVFY6QH8G6i/6RUrY7W8eBklHdhTad2H5741rv/UN1AtZNgB7atkY1jgdSmSLnfxgGEpMbxOHjL+fAkMf5+XCLZAjqxodf/edw/vW6uFtWVCdyUzM0dPmoUrr8f5SIMj9v/XYz/k62tJiHHqrXw20yjWGnIANat7YNFwOxbzYHLhRNcEH8cTFuGZBmpVvAjzCvhPqfBjqaATpV/a5AzT++fSNmnz52WXUQEgCUvW0Aj0QZiHvFcfS387E2UYFyOF2CH/D2xfyq6XCrmbE9tNFFtYGLCVfIAnu4tIOUoG9R1BfSX8F0213CgQM+TYkp98wCfKAh9Bh21P+I91ImIyngBIlz2y/HZQS6Bh1cfi9mcdEuBifpuO7ujS5DVF8DvRQ9Ja2v3X+rgpNBMgf1n9Gu3Xc=

host;x-amz-content-sha256;x-amz-date;x-amz-security-token
d65cb4efb2928a5dea3a33c6516fce43304c951887307712013b2d7c7d522690'

The String-to-Sign should have been
'AWS4-HMAC-SHA256
20230228T224146Z
20230228/us-east-1/es/aws4_request
3fe7f7af8f66878ff839d922117d1a0092e5da7454e706170588fcbb06274457'
"}

Debug data from logs:

# String to sign
AWS4-HMAC-SHA256
20230228T224146Z
20230228/us-east-1/es/aws4_request
d662232e30ea37c5863e4df2173672e55822b720eb34450a3fc14b9f09d4e8f4

# REQUEST STRING
PUT
/products_staging_index/_doc/27625530

host:redacted.us-east-1.es.amazonaws.com
x-amz-content-sha256:9ae812aa93eb349823948c6b9612c8e33742c9e77b5c09e17936de952147459b
x-amz-date:20230228T224146Z
x-amz-security-token:IQoJb3JpZ2luX2VjENf//////////wEaCXVzLWVhc3QtMSJHMEUCIQDP4+i2bhcJrsfEneblkk2nctfHgYWZ3GJ0fzfa08n8zgIgUn/tmEbKhLLR0eb36CMC75Q85hxDfnjA5pR/ZoLRlZMqnwUIgP//////////ARACGgwzMTAyNDIyMTY5MzEiDN1z6aMoRwTmvkmdKirzBGw+r5rq3kKURuRWuuyb9G+67Y3nyW6dPn4jWlkVMA9bEmocNS9nhjaMm/p7TW7M5p+l9jl27QXwHeKSjiVCbYNkpKZLqCjdbgLn23d5AXO3hFKbR8k3Idw0pPqCZIstx1d5cwOHpJOGdcJGvljWFqRnw41k24TG75oltuswrkYDeIUlXMFDzrMc2H7PNGFHfPw2BCDAxLppMFT8Nk1s/9er6hOuR9K8hWc5CnmkAMowMgM6JuHuwnA2V0yoqX32NXd/NIJa1jQnzaTRpxYkwpU9FrmsAa4pSkKulSauRKqs9xZ0gDkD4BtTmoyYDhLPuOEKgPtfTnFst/Fglm4EwqORBLhkUX32Do2w4MS6S+j0S3ceuTg8nnMdYhbPo5xWhVwwVLTWpk6dj8lLurQbXw8iFd0el7Fv27pTHKCEOHQLR42rOWvXXn9Q+jtEeuWN5wyryVbW4PfQIMKNI1Vli4aEVfwdawKTcMdVwGkUKFxTMRIlTtyznR43qc7t8/oBrpE2o2UwXx8lQsqxOAEBLI0vH/CIwMUdBAcYeV4zt3K6t9sF2yeki9XWb9cspsBCUUkBYeUkEeiMvYUDKJ332g6Jvl1N1ttHS90nm7C8LQc6tCHXQoCpvTNxDl7JgeKNkofu8swVFY6QH8G6i/6RUrY7W8eBklHdhTad2H5741rv/UN1AtZNgB7atkY1jgdSmSLnfxgGEpMbxOHjL+fAkMf5+XCLZAjqxodf/edw/vW6uFtWVCdyUzM0dPmoUrr8f5SIMj9v/XYz/k62tJiHHqrXw20yjWGnIANat7YNFwOxbzYHLhRNcEH8cTFuGZBmpVvAjzCvhPqfBjqaATpV/a5AzT++fSNmnz52WXUQEgCUvW0Aj0QZiHvFcfS387E2UYFyOF2CH/D2xfyq6XCrmbE9tNFFtYGLCVfIAnu4tIOUoG9R1BfSX8F0213CgQM+TYkp98wCfKAh9Bh21P+I91ImIyngBIlz2y/HZQS6Bh1cfi9mcdEuBifpuO7ujS5DVF8DvRQ9Ja2v3X+rgpNBMgf1n9Gu3Xc=
host;x-amz-content-sha256;x-amz-date;x-amz-security-token
9ae812aa93eb349823948c6b9612c8e33742c9e77b5c09e17936de952147459b

# HEADERS
"host"=>"redacted.us-east-1.es.amazonaws.com",
"x-amz-date"=>"20230228T224146Z",
"x-amz-security-token"=>"IQoJb3JpZ2luX2VjENf//////////wEaCXVzLWVhc3QtMSJHMEUCIQDP4+i2bhcJrsfEneblkk2nctfHgYWZ3GJ0fzfa08n8zgIgUn/tmEbKhLLR0eb36CMC75Q85hxDfnjA5pR/ZoLRlZMqnwUIgP//////////ARACGgwzMTAyNDIyMTY5MzEiDN1z6aMoRwTmvkmdKirzBGw+r5rq3kKURuRWuuyb9G+67Y3nyW6dPn4jWlkVMA9bEmocNS9nhjaMm/p7TW7M5p+l9jl27QXwHeKSjiVCbYNkpKZLqCjdbgLn23d5AXO3hFKbR8k3Idw0pPqCZIstx1d5cwOHpJOGdcJGvljWFqRnw41k24TG75oltuswrkYDeIUlXMFDzrMc2H7PNGFHfPw2BCDAxLppMFT8Nk1s/9er6hOuR9K8hWc5CnmkAMowMgM6JuHuwnA2V0yoqX32NXd/NIJa1jQnzaTRpxYkwpU9FrmsAa4pSkKulSauRKqs9xZ0gDkD4BtTmoyYDhLPuOEKgPtfTnFst/Fglm4EwqORBLhkUX32Do2w4MS6S+j0S3ceuTg8nnMdYhbPo5xWhVwwVLTWpk6dj8lLurQbXw8iFd0el7Fv27pTHKCEOHQLR42rOWvXXn9Q+jtEeuWN5wyryVbW4PfQIMKNI1Vli4aEVfwdawKTcMdVwGkUKFxTMRIlTtyznR43qc7t8/oBrpE2o2UwXx8lQsqxOAEBLI0vH/CIwMUdBAcYeV4zt3K6t9sF2yeki9XWb9cspsBCUUkBYeUkEeiMvYUDKJ332g6Jvl1N1ttHS90nm7C8LQc6tCHXQoCpvTNxDl7JgeKNkofu8swVFY6QH8G6i/6RUrY7W8eBklHdhTad2H5741rv/UN1AtZNgB7atkY1jgdSmSLnfxgGEpMbxOHjL+fAkMf5+XCLZAjqxodf/edw/vW6uFtWVCdyUzM0dPmoUrr8f5SIMj9v/XYz/k62tJiHHqrXw20yjWGnIANat7YNFwOxbzYHLhRNcEH8cTFuGZBmpVvAjzCvhPqfBjqaATpV/a5AzT++fSNmnz52WXUQEgCUvW0Aj0QZiHvFcfS387E2UYFyOF2CH/D2xfyq6XCrmbE9tNFFtYGLCVfIAnu4tIOUoG9R1BfSX8F0213CgQM+TYkp98wCfKAh9Bh21P+I91ImIyngBIlz2y/HZQS6Bh1cfi9mcdEuBifpuO7ujS5DVF8DvRQ9Ja2v3X+rgpNBMgf1n9Gu3Xc=",
"x-amz-content-sha256"=>"9ae812aa93eb349823948c6b9612c8e33742c9e77b5c09e17936de952147459b",
"authorization"=>"AWS4-HMAC-SHA256 Credential=ASIAUQO7ARPR4NNBUQ6O/20230228/us-east-1/es/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=f346ebdbf05b1e251fe554b95c93f61b3ea04a5a78ad16ae034fb8eb2f6fee28"

Everything looks good except for the security tokens not matching up 🤔 is this at all helpful is diagnosing what is going wrong?

nhtruong commented 1 year ago

If I have to guess, your security token is rotated/refreshed periodically, and you have a long running session that uses the same sigv4 client instance for all requests to OpenSearch. Eventually the cred will expire and you start getting 403.

To test this theory, write a script that makes a request every 10 seconds. Catch any 403 error, and see if after awhile, all requests start returning 403.

phoffer commented 1 year ago

We do initialize this client on app boot, do we need to re-initialize it periodically (or on failure)? That would make sense

nhtruong commented 1 year ago

IF my theory is correct then yes, re-instantiate the client or the signer (The signer is an accessible attribute of the client: client.signer = new_signer) once the cred's expired will solve it. How to implement it is up to you. See if the Ruby AWS SDK has the capability to refresh the cred on its own or at least tell you when it's about to expire. :-)

phoffer commented 1 year ago

We are using Aws::AssumeRoleWebIdentityCredentials which will allegedly auto refresh. I wonder if there is some scenario under which it doesn't. Can you think of any reason this credential class wouldn't work, or we should be using a different one?

nhtruong commented 1 year ago

Haven't used Aws::AssumeRoleWebIdentityCredentials myself so I don't know the answer but that is strange. Have you tested that theory yet?

Dantemss commented 1 year ago

I'm also running into this. One case I noticed where this always happens is when sending a create index request that contains the char_filter mapping with => (the actual mapping doesn't matter).

I also noticed that if the index name had %25 the debug message complained about %252525. Something may be wrong with the escaping?

nhtruong commented 1 year ago

Thanks for reaching out. Would you mind turning on Debug mode and paste the logs here?

Dantemss commented 1 year ago

I might have found the solution actually... the following patch seems to work for me:

OpenSearch::Aws::Sigv4Client.class_exec do
  def perform_request(method, path, params = {}, body = nil, headers = nil)
    signature_body = body.is_a?(Hash) ? body.to_json : body.to_s
    signature = sigv4_signer.sign_request(
      http_method: method,
      url: signature_url(path, params),
      headers: headers,
      body: signature_body
    )
    headers = (headers || {}).merge(signature.headers)

    log_signature_info(signature)

    # Patch to use signature_body instead of body on the following line:
    super(method, path, params, signature_body, headers)
  end
end

I'm still testing it though

nhtruong commented 1 year ago

@Dantemss nice find and that's really interesting. I wonder if perform_request alters the body somehow. Lemme do some digging.

maxfierke commented 1 year ago

If you have any observability integrations that instrument the various HTTP clients or AWS SDKs, you may want to check on those. In particular, we've found that OpenTelemetry's Faraday middleware instrumentation can interfere with the signatures

nhtruong commented 1 year ago

@Dantemss Looks like the OS Ruby gem uses Multi-JSON to serialize the body by default, while the signature is generated with a body serialized with the native JSON gem. This might be the cause of the mismatching signature in the body if what you described is correct (i.e. it happens consistently with certain characters). Lemme know if you run into any issues using the monkey patch workaround you mentioned above. Feel free to make a PR into the Sigv4 Gem repo.

Not the same issue that @phoffer was dealing with tho (where the mismatching happened in the creds). What @maxfierke mentioned above is also worth looking into if you're dealing with random sigv4 errors.

Dantemss commented 1 year ago

The monkeypatch seems to have fixed the signature issues that we had. I'll make a PR.

nhtruong commented 1 year ago

This begs the question: Should we replace MultiJSON with JSON as the default serializer?

Dantemss commented 1 year ago

I think it's up to y'all but whichever one you use, I'd say it would be better to be consistent and use the same one everywhere.

dblock commented 1 year ago

@nhtruong yes, multi-json is just a shim in front of various serializers, and I suggest we remove it - Rails did it in 2013, https://github.com/rails/rails/pull/10576, or we could do what we did in Grape, ie. users can require multi_json explicitly in which case it will be used - https://github.com/ruby-grape/grape/pull/1623

dblock commented 1 year ago

@Dantemss want to try to PR ^ ?

praveen-ks commented 3 months ago

Hi @phoffer

I am facing the same issue, any workaround that can help here ? Or how did you fix it?

dblock commented 3 months ago

@praveen-ks

Looks like https://github.com/opensearch-project/opensearch-ruby-aws-sigv4/pull/24 fixed the bug as reported, and was released in opensearch-ruby-aws-sigv4 v1.2.1. I am going to close this issue to avoid confusion, please make sure your Gemfile references that or a newer version? Let us know if you still see the problem after that.

praveen-ks commented 3 months ago

@dblock I am using v1.2.1 only but still facing the issue.

dblock commented 3 months ago

@praveen-ks See above for turning on debugging, open a new one with details since it's likely not the same problem.

https://github.com/opensearch-project/opensearch-ruby/issues/141#issuecomment-1447342826

praveen-ks commented 3 months ago

problem

@dblock Thanks for the next steps.

But I can't spend time on this currently, I am continuing with faraday_middleware-aws-sigv4 gem as I was using with elasticsearch-ruby gem. Are there any concerns with using faraday_middleware-aws-sigv4 with opensearch-ruby ?

dblock commented 3 months ago

Are there any concerns with using faraday_middleware-aws-sigv4 with opensearch-ruby ?

Not that I know of.