Failed to get cluster name for elasticsearch during health check routine

ascoppa commented 1 week ago

Description

After upgrading the New Relic agent from version 9.0.0 to 9.10.2, we began encountering multiple signature errors on our monitoring systems. Upon further inspection, the following errors were identified in our logs:

[NewRelic][2024-06-17 20:12:22 +0000 web.1 (60)] ERROR : Failed to get cluster name for elasticsearch

and immediately below this ☝️ line

[NewRelic] ERROR : Elasticsearch::Transport::Transport::Errors::Forbidden: [403] {"message":"The request signature 
we calculated does not match the signature you provided.  Check your AWS Secret Access Key and signing method. 
Consult the service documentation for details. The Canonical String for this request should have been 'GET / 
content-type:application/json host:[HIDDEN-FOR-SECURITY-REASONS]  user-agent:Faraday v1.10.2 
x-amz-content-sha256:[HIDDEN-FOR-SECURITY-REASONS] x-amz-date:20240617T201222Z 
x-elastic-client-meta:es=6.8.3,rb=3.1.4,t=6.8.3,fd=1.10.2,ty=1.4.0 content-type;host;user-agent;
x-amz-content-sha256;x-amz-date;x-elastic-client-meta [HIDDEN-FOR-SECURITY-REASONS]' 
The String-to-Sign should have been '[HIDDEN-FOR-SECURITY-REASONS]' "}

These requests are being done by new relic ruby agent apparently as part of a health check routine (they trigger multiple times over time) so as a workaround we decided to disable new relic instrumentation on Elasticsearch by setting the following environment variable NEW_RELIC_INSTRUMENTATION_ELASTICSEARCH to disabled. After doing so, the signature errors disappeared.

Expected Behavior

No signature errors should appear during the health check routine.

Your Environment

Ruby -> 3.1.4
Rails -> 6.1.7.7
Elasticsearch -> 6.2
newrelic_rpm -> 9.10.2

Note

If you need any further clarification, please don't hesitate to ask. Thank you.

workato-integration[bot] commented 1 week ago

https://new-relic.atlassian.net/browse/NR-284469

hannahramadan commented 1 week ago

Hi @ascoppa! Thanks for letting us know about the issue. Between agent verisons 9.0.0 and 9.10.2, we started using a different endpoint to get the cluster name. This helped with performance but might be causing the error you're seeing.

We don't officially support Elasticsearch versions below 7, but we want to to explore this issue a little further to see if this is an Elasticsearch version issue or something else. You also make a good point about the number of errors we're generating, so we're going to look into reducing those.

newrelic / newrelic-ruby-agent