big stack trace when Artemis is not ready yet (or is stopping)

erwindon commented 11 months ago

when metrics are collected before the broker is completely started, the exception "IllegalStateException: Broker is not started. It can not be managed yet" may be raised. the exact occurrence frequency depends on the polling frequency of the metrics collector. and the same may happen when a "stop" command is given.

the error message looks like this:

2023-09-28 08:25:36,091 WARN  [io.micrometer.core.instrument.internal.DefaultGauge] Failed to apply the value function for the gauge 'artemis.address.memory.usage.percentage'. Note that subsequent logs will be logged at debug level.
java.lang.IllegalStateException: Broker is not started. It can not be managed yet
    at org.apache.activemq.artemis.core.management.impl.ActiveMQServerControlImpl.checkStarted(ActiveMQServerControlImpl.java:4441) ~[artemis-server-2.31.0.jar:2.31.0]
    at org.apache.activemq.artemis.core.management.impl.ActiveMQServerControlImpl.getGlobalMaxSize(ActiveMQServerControlImpl.java:757) ~[artemis-server-2.31.0.jar:2.31.0]
    at org.apache.activemq.artemis.core.management.impl.ActiveMQServerControlImpl.getAddressMemoryUsagePercentage(ActiveMQServerControlImpl.java:804) ~[artemis-server-2.31.0.jar:2.31.0]
    at org.apache.activemq.artemis.core.server.management.impl.ManagementServiceImpl.lambda$registerBrokerMeters$3(ManagementServiceImpl.java:233) ~[artemis-server-2.31.0.jar:2.31.0]
    at io.micrometer.core.instrument.internal.DefaultGauge.value(DefaultGauge.java:53) ~[micrometer-core-1.9.5.jar:1.9.5]
    at io.micrometer.prometheus.PrometheusMeterRegistry.lambda$newGauge$5(PrometheusMeterRegistry.java:331) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.micrometer.prometheus.MicrometerCollector.collect(MicrometerCollector.java:75) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.Collector.collect(Collector.java:45) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:219) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:152) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:71) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.prometheus.client.exporter.common.TextFormat.writeFormat(TextFormat.java:53) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.micrometer.prometheus.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:154) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.micrometer.prometheus.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:149) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.micrometer.prometheus.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:120) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at io.micrometer.prometheus.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:107) ~[artemis-prometheus-metrics-plugin-2.1.0.jar:?]
    at com.redhat.amq.broker.core.server.metrics.plugins.ArtemisPrometheusMetricsPluginServlet.doGet(ArtemisPrometheusMetricsPluginServlet.java:57) ~[classes/:?]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:503) ~[jetty-servlet-api-4.0.6.jar:?]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:590) ~[jetty-servlet-api-4.0.6.jar:?]
    at org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1410) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.apache.activemq.artemis.component.JolokiaFilter.doFilter(JolokiaFilter.java:50) ~[artemis-web-2.31.0.jar:2.31.0]
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) ~[jetty-security-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1570) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) ~[jetty-servlet-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1543) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:51) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282) ~[jetty-server-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) ~[jetty-io-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[jetty-io-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[jetty-io-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:416) ~[jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:385) ~[jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:272) ~[jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:140) ~[jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) [jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) [jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) [jetty-util-10.0.15.jar:10.0.15]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) [jetty-util-10.0.15.jar:10.0.15]
    at java.lang.Thread.run(Thread.java:829) [?:?]

that is a big scary stacktrace for something relatively common.

I propose that this stacktrace is suppressed...

brusdev commented 11 months ago

@erwindon your proposal makes sense to me. ArtemisPrometheusMetricsPluginServlet could catch the exceptions caused by registry.scrape(), write those exceptions in the log with the DEBUG level, and return SC_INTERNAL_SERVER_ERROR. Are you planning to contribute with a PR?

erwindon commented 11 months ago

Are you planning to contribute with a PR?

I did some research on that and at that time could not determine the best approach. there were doubts on where to catch the error: on a specific place (deep down) or on a generic place (at the top). and then there were doubts about how to distinguish this cause from any other cause.

but all these questions have been answered by your reply:

error is to be caught at the top level
the mechanism works for all types of exceptions, no need to recognize the "Broker is not started"

so, creating a PR is now (almost) possible for me.

@brusdev one remaining question though... I could not spot any code for logging in the plugin. so how should errors be logged? do I just take the log4j dependencies from the main artemis project and then use LOG.debug?

brusdev commented 11 months ago

@erwindon yes that solution would require a provided dependency for org.slf4j:slf4j-api to get a logger, i.e.

private static final Logger logger = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

@jbertram WDYT?

jbertram commented 11 months ago

@brusdev, catching any exception and logging it would certainly preferable to dumping a stack-trace like this. I'm thinking the logging should be done at DEBUG level. Once the logging is added it would be useful to also update the documentation to let users know that they can enable DEBUG logging if they are having trouble.

erwindon commented 11 months ago

FYI had to fix a local problem "Error injecting: org.apache.maven.plugin.war.WarMojo" while testing solved by following https://stackoverflow.com/questions/66920567/error-injecting-org-apache-maven-plugin-war-warmojo

erwindon commented 11 months ago

@brusdev @jbertram

I'm thinking the logging should be done at DEBUG level.

using only that would completely hide the error on the broker side. since it is a catch-all, that may hide any other problem besides the original problem from this issue. I propose to still log the original WARNing, but with just the exception name and description. And additionally provide the stack info in a DEBUG log.

jbertram commented 11 months ago

@erwindon, that sounds good.

erwindon commented 11 months ago

@brusdev @jbertram it's not as simple as we think...

the stack-trace does not seem to be printed by the framework that calls doGet. using the code below, the stacktrace is always printed in combination with "OK", and "ERROR" never occurs.

(simplified) code from ArtemisPrometheusMetricsPluginServlet.java

         try (Writer writer = resp.getWriter()) {
            writer.write(registry.scrape());
            writer.flush();
            logger.error("OK");
         } catch(Exception ex) {
            logger.error("ERROR", ex);
         }

I then inspected the code from micrometer-metrics/micrometer. there, class DefaultGauge around line 50, has a log statement that produces the warning that we see. the exception is not thrown, but only logged. this fully explains why/how the stacktrace is produced.

I think the best solution is to somehow call server.isStarted() before trying to collect the actual metrics and returning the HttpServletResponse.SC_INTERNAL_SERVER_ERROR result immediately when the broker is not started. Obviously, that is easier said than done, because that will need changes on both Artemis and this plugin.

brusdev commented 11 months ago

@erwindon a solution could be to add registered method to ActiveMQMetricsPlugin (see a POC https://github.com/apache/activemq-artemis/compare/main...brusdev:activemq-artemis:register_metrics_plugin?expand=1) and extend PrometheusMeterRegistry to add the server property (see a POC https://github.com/rh-messaging/artemis-prometheus-metrics-plugin/compare/add_artemis_registry?expand=1)

However this solution could not prevent those exceptions on server stopping.

erwindon commented 11 months ago

@brusdev that looks very much like the solution I tried to set up (but not completed). so instant happiness.

[unregistered()]

this does not seem to be used at this moment?

note that there is still a very small risk on a race-condition as the server may go down immediately after the plugin verified isStarted() but before all metrics are actually collected. that is an acceptable fact, also because it is not worse than the current situation.

However this solution could not prevent those exceptions on server stopping.

the exceptions are because the internal metrics functions all inspect server.isStarted(). since that test is also added here, my expectation was that it works for both startup and shutdown?

brusdev commented 11 months ago

@erwindon TL; DR; I wrote the POCs only to validate the solution, feel free to reuse, change or ignore them 😀

I confirm unregistered() is not required.

However this solution could not prevent those exceptions on server stopping. I was meaning the very small risk on a race-condition that you described better than me 😅

erwindon commented 11 months ago

feel free to reuse, change or ignore them

lazy me was assuming the work was almost done now :-) I understand that you want me to turn this into 2 actual PRs?

brusdev commented 11 months ago

@erwindon I could try to create the PR for artemis the next week if there is no rush.

erwindon commented 11 months ago

@brusdev that would be great as I cannot easily build a proper Artemis release. no problem for the artemis-prometheus-metrics-plugin though

brusdev commented 11 months ago

@erwindon I created the PR for artemis: https://github.com/apache/activemq-artemis/pull/4645

brusdev commented 11 months ago

@jbertram has already merged https://github.com/apache/activemq-artemis/pull/4645, it will be included in ActiveMQ Artemis 2.32.0

brusdev commented 7 months ago

@erwindon ActiveMQ Artemis 2.32.0 was released, are you working on this PR?

erwindon commented 7 months ago

@erwindon ActiveMQ Artemis 2.32.0 was released, are you working on this PR?

I did not jump on this one immediately, but otherwise yes.

erwindon commented 6 months ago

@brusdev unfortunately, I have to give up on providing a full PR for this one. no problem with added the real extra condition, but the introduction of ArtemisPrometheusMeterRegistry class for additional administration causes problems in the dependencies between the 2 subprojects. There may be a simple solution there (in java or mvn), but I just don't see it.

brusdev commented 6 months ago

@erwindon I create a PR from the POC that I did, could you review it?

brusdev commented 2 months ago

Resolved by 52a1032eab354382fd522b9606916c2da498f3cc

rh-messaging / artemis-prometheus-metrics-plugin

big stack trace when Artemis is not ready yet (or is stopping) #19