noxdafox / rabbitmq-cloudwatch-exporter

RabbitMQ Plugin for publishing cluster metrics to AWS CloudWatch
Mozilla Public License 2.0
42 stars 9 forks source link

Errors in logs when publishing metrics #7

Closed curry-lee closed 4 years ago

curry-lee commented 5 years ago

Hi:

I had install and also enable this plugin, and setup credential access of AWS. However, CloudWatch console did not received any metrics from RabbitMQ, and also there are no info or error log output from RabbitMQ logger, do you have any idea about how to process this?

Running two nodes of cluster {rabbitmq_cloudwatch_exporter,"rabbitmq_cloudwatch_exporter","0.2.1"} {rabbit,"RabbitMQ","3.7.16"}, {erlang_version, "Erlang/OTP 21 [erts-10.3.5.1] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64] [hipe]\n"}

Thanks

noxdafox commented 5 years ago

Hello,

a typical mistake is to forget to configure the region.

CW metrics are published within a region and visibile only in that region. If you did not configure the region, metrics will end up in the default one: us-east-1.

If you think the region is correct, please provide me the configuration values you used to configure the plugin (remember to blank the secrets).

curry-lee commented 5 years ago

Hi:

Thanks for reply. We did check out the region for default value "us-east-1", and also no metric appear in console, the RabbitMQ config described as follows:

{rabbitmq_cloudwatch_exporter, [
   {metrics, [overview, vhost, node, exchange, queue, connection, channel]}
  ], [
   {aws, [
     {access_key_id, "xx"},
     {secret_access_key, "xx"},
     {region, "ap-northeast-1"}
   ]}
  ]}

And also the plugin list as follows:

[  ] rabbitmq_amqp1_0                  3.7.16
[  ] rabbitmq_auth_backend_cache       3.7.16
[  ] rabbitmq_auth_backend_http        3.7.16
[  ] rabbitmq_auth_backend_ldap        3.7.16
[  ] rabbitmq_auth_mechanism_ssl       3.7.16
[E*] rabbitmq_cloudwatch_exporter      0.2.1
[  ] rabbitmq_consistent_hash_exchange 3.7.16
[  ] rabbitmq_event_exchange           3.7.16
[  ] rabbitmq_federation               3.7.16
[  ] rabbitmq_federation_management    3.7.16
[  ] rabbitmq_jms_topic_exchange       3.7.16
[E*] rabbitmq_management               3.7.16
[e*] rabbitmq_management_agent         3.7.16

Thanks

noxdafox commented 5 years ago

The plugin follows RMQ standard logging. If there's any issue, you should be able to see it in the logs.

If you run RMQ under Linux with systemd, you might want to check there as well with the command:

$ sudo journalctl -u rabbitmq-server.service
curry-lee commented 5 years ago

Hi:

Sorry, it is my fault, I did restart the app but did not restart the rabbitmq-server service, and it works right, thank you for your help.

However, some error message list as follows:

{#{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.1.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.2.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>e2c5137c-a50c-49f9-9318-9e2eb66b976c</RequestId>\\n</ErrorResponse>\\n\", headers: [{\"x-amzn-RequestId\", \"e2c5137c-a50c-49f9-9318-9e2eb66b976c\"}, {\"Content-Type\", \"text/xml\"}, {\"Content-Length\", \"399\"}, {\"Date\", \"Thu, 01 Aug 2019 09:55:12 GMT\"}, {\"Connection\", \"close\"}], status_code: 400}}}\n">>},[{'Elixir.ExAws','request!',2,[{file,"lib/ex_aws.ex"},{line,66}]},{'Elixir.Enum','-map/2-lists^map/1-0-',2,[{file,"lib/enum.ex"},{line,1327}]},{'Elixir.Enum','-map/2-lists^map/1-0-',2,[{file,"lib/enum.ex"},{line,1327}]},{'Elixir.RabbitMQCloudWatchExporter.Exporter',handle_info,2,[{file,"lib/rabbitmq_cloudwatch_exporter/exporter.ex"},{line,60}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-08-01 09:55:12.873 [error] <0.1009.0> CRASH REPORT Process <0.1009.0> with 1 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.1.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.2.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>e2c5137c-a50c-49f9-...">>} in 'Elixir.ExAws':'request!'/2 line 66
2019-08-01 09:55:12.874 [error] <0.276.0> Supervisor 'Elixir.Singleton.Supervisor' had child 'Elixir.Singleton.Manager' started with 'Elixir.Singleton.Manager':start_link('Elixir.RabbitMQCloudWatchExporter.Exporter', [], rabbitmq_cloudwatch_exporter, 'singleton_0bpF3SXqPZTQc80tAf+lu7E5kqQ') at <0.1008.0> exit with reason #{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.1.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.2.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>e2c5137c-a50c-49f9-...">>} in 'Elixir.ExAws':'request!'/2 line 66 in context child_terminated
noxdafox commented 5 years ago

This is odd, it complains a metric point is missing but the logic should already handle missing data points.

Could you please try this build and show me the output? It will log all metrics before publishing them helping me out understanding which metric gets corrupted and why.

You need Erlang 21 and elixir 1.8. Make sure your log level is set at least to info. You will see the data point after the Publishing metrics to CW: line. Please print that output so I can see what's wrong with your broker.

SethPyle376 commented 4 years ago

@noxdafox Here's some logs from it breaking. This ONLY happens when it attempts to export exchange metrics.

To reproduce:

  1. Enable exchange metrics
  2. Start the server
  3. Attempt to send some messages through an exchange
2019-12-16 14:40:13.907 [info] <0.1501.0> Publishing metrics to CW: [[{metric_name,<<"MessagesReady">>},{unit,<<"Count">>},{value,101},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"MessagesUnacknowledged">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"Messages">>},{unit,<<"Count">>},{value,101},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"Publish">>},{unit,<<"Count">>},{value,1},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"Confirm">>},{unit,<<"Count">>},{value,1},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"ReturnUnroutable">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"ClusterOverview">>}]}],[{metric_name,<<"Memory">>},{unit,<<"Bytes">>},{value,36916},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Consumers">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesReady">>},{unit,<<"Count">>},{value,78},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesUnacknowledged">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Messages">>},{unit,<<"Count">>},{value,78},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Publish">>},{unit,<<"Count">>},{value,1},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"cw-export">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Memory">>},{unit,<<"Bytes">>},{value,13724},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"game_topic">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Consumers">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"game_topic">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesReady">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"game_topic">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesUnacknowledged">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"game_topic">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Messages">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"game_topic">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Memory">>},{unit,<<"Bytes">>},{value,13724},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"testqueue">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Consumers">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"testqueue">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesReady">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"testqueue">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"MessagesUnacknowledged">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit@seth-laptop">>},{<<"Metric">>,<<"Queue">>},{<<"Queue">>,<<"testqueue">>},{<<"VHost">>,<<"/">>}]}],[{metric_name,<<"Messages">>},{unit,<<"Count">>},{value,0},{dimensions,[{<<"Cluster">>,<<"rabbit...">>},...]}],...]
2019-12-16 14:40:14.366 [error] <0.1501.0> ** Generic server rabbitmq_cloudwatch_exporter terminating 
** Last message in was export_metrics
** When Server state == [[{period,15},{collectors,[overview,queue,node,exchange]},{namespace,<<"RabbitMQ">>}],[{secret_access_key,<<"">>},{access_key_id,<<"">>},{region,<<"us-east-1">>}]]
** Reason for termination == 
** {#{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.12.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.13.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>65a8c6d4-a5e2-42ea-bafc-3d52d8a6aee3</RequestId>\\n</ErrorResponse>\\n\", headers: [{\"x-amzn-RequestId\", \"65a8c6d4-a5e2-42ea-bafc-3d52d8a6aee3\"}, {\"Content-Type\", \"text/xml\"}, {\"Content-Length\", \"401\"}, {\"Date\", \"Mon, 16 Dec 2019 21:40:13 GMT\"}, {\"Connection\", \"close\"}], status_code: 400}}}\n">>},[{'Elixir.ExAws','request!',2,[{file,"lib/ex_aws.ex"},{line,66}]},{'Elixir.Enum','-map/2-lists^map/1-0-',2,[{file,"lib/enum.ex"},{line,1327}]},{'Elixir.Enum','-map/2-lists^map/1-0-',2,[{file,"lib/enum.ex"},{line,1327}]},{'Elixir.RabbitMQCloudWatchExporter.Exporter',handle_info,2,[{file,"lib/rabbitmq_cloudwatch_exporter/exporter.ex"},{line,61}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-12-16 14:40:14.367 [error] <0.1501.0> CRASH REPORT Process <0.1501.0> with 1 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.12.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.13.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>65a8c6d4-a5e2-42e...">>} in 'Elixir.ExAws':'request!'/2 line 66
2019-12-16 14:40:14.367 [error] <0.343.0> Supervisor 'Elixir.Singleton.Supervisor' had child 'Elixir.Singleton.Manager' started with 'Elixir.Singleton.Manager':start_link('Elixir.RabbitMQCloudWatchExporter.Exporter', [], rabbitmq_cloudwatch_exporter, 'singleton_0bpF3SXqPZTQc80tAf+lu7E5kqQ') at <0.1500.0> exit with reason #{'__exception__' => true,'__struct__' => 'Elixir.ExAws.Error',message => <<"ExAws Request Error!\n\n{:error, {:http_error, 400, %{body: \"<ErrorResponse xmlns=\\\"http://monitoring.amazonaws.com/doc/2010-08-01/\\\">\\n  <Error>\\n    <Type>Sender</Type>\\n    <Code>MissingParameter</Code>\\n    <Message>The parameter MetricData.member.12.Dimensions.member.3.Value is required.\\nThe parameter MetricData.member.13.Dimensions.member.3.Value is required.</Message>\\n  </Error>\\n  <RequestId>65a8c6d4-a5e2-42e...">>} in 'Elixir.ExAws':'request!'/2 line 66 in context child_terminated
noxdafox commented 4 years ago

Thanks for providing the above. Does it fail consistently on MetricData.member.12.Dimensions.member.3.Value or is it failing on different values every time?

Looking at the metric itself it seems to miss the value for the queue cw-export which actually is present in the data.

This kind of suggests me a problem with truncation or internal batching the API might be doing.

SethPyle376 commented 4 years ago

Glancing up through the logs I'm only seeing it fail on MetricData.member.13.Dimensions.member.3.Value

SethPyle376 commented 4 years ago

One more piece of info I just learned that might help. This only happens when I publish on the AMQP default exchange. Publishing on other exchanges works fine. Perhaps AWS doesn't like the name of that default exchange (I think it's "/") perhaps you need to switch it to %2F before you send it over

noxdafox commented 4 years ago

I tested the output you provided and it worked. Issue is it's truncated and is probably missing the offending data points (no exchange related metrics where logged).

I built a new one hopefully avoiding the truncation issue. Could you please try this out?

plugin.zip

SethPyle376 commented 4 years ago

@noxdafox Sorry for the delay. I'm still getting the error using the build you included in your last comment after publishing a message to a default exchange when I have exchange metrics enabled.

noxdafox commented 4 years ago

The last build is not trying to address the error but to better print the cause. The previous output you provided me is truncated so I slightly modified the routine to avoid the truncation.

Could you please paste me the new output you see on the log?

You should now see this message:

Publishing CW Metrics chunk: ...

Followed by the error traceback. Could you please provide me the last message above before seeing the error traceback?

SethPyle376 commented 4 years ago

Oh gotcha sorry for the misunderstanding. I attached the logs. I noticed there is an exchange key with no value in there. cloudwatch_exporter_logs.txt

noxdafox commented 4 years ago

There we go! The default exchange has an empty string as name which is invalid as CW metric dimension. This went unnoticed as RMQ reveals exchange metrics only for exchanges which are being in use.

That's the reason why I was struggling to reproduce it. I had to trigger some activity with the default exchange in order to observe the issue.

As a solution I opted for a _ as default exchange name as it seemed the less prone to name collision (default might have been used as exchange name by somebody).

The new logic also avoids crashing but rather continues publishing the rest of the metrics and logs the problematic group of metrics.