Closed erwindon closed 2 months ago
@erwindon your proposal makes sense to me. ArtemisPrometheusMetricsPluginServlet could catch the exceptions caused by registry.scrape()
, write those exceptions in the log with the DEBUG level, and return SC_INTERNAL_SERVER_ERROR. Are you planning to contribute with a PR?
Are you planning to contribute with a PR?
I did some research on that and at that time could not determine the best approach. there were doubts on where to catch the error: on a specific place (deep down) or on a generic place (at the top). and then there were doubts about how to distinguish this cause from any other cause.
but all these questions have been answered by your reply:
so, creating a PR is now (almost) possible for me.
@brusdev
one remaining question though...
I could not spot any code for logging in the plugin.
so how should errors be logged?
do I just take the log4j dependencies from the main artemis project and then use LOG.debug
?
@erwindon yes that solution would require a provided dependency for org.slf4j:slf4j-api to get a logger, i.e.
private static final Logger logger = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
@jbertram WDYT?
@brusdev, catching any exception and logging it would certainly preferable to dumping a stack-trace like this. I'm thinking the logging should be done at DEBUG level. Once the logging is added it would be useful to also update the documentation to let users know that they can enable DEBUG logging if they are having trouble.
FYI had to fix a local problem "Error injecting: org.apache.maven.plugin.war.WarMojo" while testing solved by following https://stackoverflow.com/questions/66920567/error-injecting-org-apache-maven-plugin-war-warmojo
@brusdev @jbertram
I'm thinking the logging should be done at DEBUG level.
using only that would completely hide the error on the broker side. since it is a catch-all, that may hide any other problem besides the original problem from this issue. I propose to still log the original WARNing, but with just the exception name and description. And additionally provide the stack info in a DEBUG log.
@erwindon, that sounds good.
@brusdev @jbertram it's not as simple as we think...
the stack-trace does not seem to be printed by the framework that calls doGet
.
using the code below, the stacktrace is always printed in combination with "OK", and "ERROR" never occurs.
(simplified) code from ArtemisPrometheusMetricsPluginServlet.java
try (Writer writer = resp.getWriter()) {
writer.write(registry.scrape());
writer.flush();
logger.error("OK");
} catch(Exception ex) {
logger.error("ERROR", ex);
}
I then inspected the code from micrometer-metrics/micrometer
.
there, class DefaultGauge
around line 50, has a log statement that produces the warning that we see. the exception is not thrown, but only logged.
this fully explains why/how the stacktrace is produced.
I think the best solution is to somehow call server.isStarted()
before trying to collect the actual metrics and returning the HttpServletResponse.SC_INTERNAL_SERVER_ERROR
result immediately when the broker is not started.
Obviously, that is easier said than done, because that will need changes on both Artemis and this plugin.
@erwindon a solution could be to add registered method to ActiveMQMetricsPlugin (see a POC https://github.com/apache/activemq-artemis/compare/main...brusdev:activemq-artemis:register_metrics_plugin?expand=1) and extend PrometheusMeterRegistry to add the server property (see a POC https://github.com/rh-messaging/artemis-prometheus-metrics-plugin/compare/add_artemis_registry?expand=1)
However this solution could not prevent those exceptions on server stopping.
@brusdev that looks very much like the solution I tried to set up (but not completed). so instant happiness.
[
unregistered()
]
this does not seem to be used at this moment?
note that there is still a very small risk on a race-condition as the server may go down immediately after the plugin verified isStarted()
but before all metrics are actually collected. that is an acceptable fact, also because it is not worse than the current situation.
However this solution could not prevent those exceptions on server stopping.
the exceptions are because the internal metrics functions all inspect server.isStarted()
.
since that test is also added here, my expectation was that it works for both startup and shutdown?
@erwindon TL; DR; I wrote the POCs only to validate the solution, feel free to reuse, change or ignore them 😀
I confirm unregistered()
is not required.
However this solution could not prevent those exceptions on server stopping.
I was meaning the very small risk on a race-condition
that you described better than me 😅
feel free to reuse, change or ignore them
lazy me was assuming the work was almost done now :-) I understand that you want me to turn this into 2 actual PRs?
@erwindon I could try to create the PR for artemis the next week if there is no rush.
@brusdev that would be great as I cannot easily build a proper Artemis release. no problem for the artemis-prometheus-metrics-plugin though
@erwindon I created the PR for artemis: https://github.com/apache/activemq-artemis/pull/4645
@jbertram has already merged https://github.com/apache/activemq-artemis/pull/4645, it will be included in ActiveMQ Artemis 2.32.0
@erwindon ActiveMQ Artemis 2.32.0 was released, are you working on this PR?
@erwindon ActiveMQ Artemis 2.32.0 was released, are you working on this PR?
I did not jump on this one immediately, but otherwise yes.
@brusdev unfortunately, I have to give up on providing a full PR for this one. no problem with added the real extra condition, but the introduction of ArtemisPrometheusMeterRegistry class for additional administration causes problems in the dependencies between the 2 subprojects. There may be a simple solution there (in java or mvn), but I just don't see it.
@erwindon I create a PR from the POC that I did, could you review it?
Resolved by 52a1032eab354382fd522b9606916c2da498f3cc
when metrics are collected before the broker is completely started, the exception "IllegalStateException: Broker is not started. It can not be managed yet" may be raised. the exact occurrence frequency depends on the polling frequency of the metrics collector. and the same may happen when a "stop" command is given.
the error message looks like this:
that is a big scary stacktrace for something relatively common.
I propose that this stacktrace is suppressed...