micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.48k stars 991 forks source link

Issue with stackdriver in case publish method of StackdriverMeterRegistry is called with too litle time distance #4763

Open juergencodes opened 8 months ago

juergencodes commented 8 months ago

Describe the bug This is a follow up of https://github.com/micrometer-metrics/micrometer/issues/4353. We tried the new version 1.12.3, for which the close now happens in a proper way. However, this revealed a new problem: When StackdriverMeterRegistry is closed, there is an error from GCP that complains about two or more time series with timestamps too close to each other.

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: global{} timeSeries[49]: custom.googleapis.com/spring/batch/step/active/activeTasks{spring_batch_step_status:EXECUTING,spring_batch_step_name:pubsub,instanceId:20240221074957,application:my-application,environment:develop,spring_batch_step_job_name:my-job}; One or more points were written more frequently than the maximum sampling period configured for the metric.: global{} timeSeries[22]: custom.googleapis.com/spring/batch/step/max{spring_batch_step_name:pubsub,spring_batch_step_status:COMPLETED,instanceId:20240221074957,error:none,application:my-application,environment:develop,spring_batch_step_job_name:my-job};
...

custom.googleapis.com/hikaricp/connections/min{instanceId:20240221074957,application:my-application,environment:develop,pool:HikariPool-1}
    at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92) ~[gax-2.41.0.jar:2.41.0]
    at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98) ~[gax-grpc-2.41.0.jar:2.41.0]
    at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66) ~[gax-grpc-2.41.0.jar:2.41.0]
    at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) ~[gax-grpc-2.41.0.jar:2.41.0]
    at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84) ~[api-common-2.24.0.jar:2.24.0]
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127) ~[guava-32.1.3-jre.jar:na]
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-32.1.3-jre.jar:na]
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286) ~[guava-32.1.3-jre.jar:na]
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055) ~[guava-32.1.3-jre.jar:na]
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807) ~[guava-32.1.3-jre.jar:na]
    at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568) ~[grpc-stub-1.61.0.jar:1.61.0]
    at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538) ~[grpc-stub-1.61.0.jar:1.61.0]
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-api-1.61.0.jar:1.61.0]
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-api-1.61.0.jar:1.61.0]
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-api-1.61.0.jar:1.61.0]
    at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570) ~[gax-grpc-2.41.0.jar:2.41.0]
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574) ~[grpc-core-1.61.0.jar:1.61.0]
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72) ~[grpc-core-1.61.0.jar:1.61.0]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742) ~[grpc-core-1.61.0.jar:1.61.0]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.61.0.jar:1.61.0]
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.61.0.jar:1.61.0]
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.61.0.jar:1.61.0]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
    Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
        at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57) ~[gax-2.41.0.jar:2.41.0]
        at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) ~[gax-2.41.0.jar:2.41.0]
        at com.google.cloud.monitoring.v3.MetricServiceClient.createTimeSeries(MetricServiceClient.java:1906) ~[google-cloud-monitoring-3.35.0.jar:3.35.0]
        at io.micrometer.stackdriver.StackdriverMeterRegistry.publish(StackdriverMeterRegistry.java:222) ~[micrometer-registry-stackdriver-1.12.3.jar:1.12.3]
        at io.micrometer.core.instrument.push.PushMeterRegistry.publishSafelyOrSkipIfInProgress(PushMeterRegistry.java:64) ~[micrometer-core-1.12.3.jar:1.12.3]
        at io.micrometer.core.instrument.push.PushMeterRegistry.close(PushMeterRegistry.java:133) ~[micrometer-core-1.12.3.jar:1.12.3]
        at io.micrometer.core.instrument.step.StepMeterRegistry.close(StepMeterRegistry.java:163) ~[micrometer-core-1.12.3.jar:1.12.3]
        at io.micrometer.stackdriver.StackdriverMeterRegistry.close(StackdriverMeterRegistry.java:143) ~[micrometer-registry-stackdriver-1.12.3.jar:1.12.3]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
        at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:316) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:249) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1202) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1195) ~[spring-beans-6.1.3.jar:6.1.3]
        at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1183) ~[spring-context-6.1.3.jar:6.1.3]
        at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1144) ~[spring-context-6.1.3.jar:6.1.3]
        at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1090) ~[spring-context-6.1.3.jar:6.1.3]
        at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145) ~[spring-boot-3.2.2.jar:3.2.2]
        at java.base/java.lang.Iterable.forEach(Iterable.java:75) ~[na:na]
        at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114) ~[spring-boot-3.2.2.jar:3.2.2]
        ... 1 common frames omitted

We suspect that this is caused by following reasons:

Environment

To Reproduce Simply using StackdriverMeterRegistry will cause the issue.

Expected behavior Close methods of PushMeterRegistry and StepMeterRegistry should be fixed in a way that the publish method will get called only once. Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish. Probably, this must happen before the local class Batch is instantiated, as this actually sets the timestamp of the time series.

Additional context This problem only appeared after https://github.com/micrometer-metrics/micrometer/issues/4353 was fixed.

shakuzen commented 8 months ago

Thanks for the report. I'm surprised other users didn't report this before when I asked for verification of the fix in https://github.com/micrometer-metrics/micrometer/issues/4353. Regardless, this is the kind of thing that could have been caught if we had an integration test setup with Stackdriver. I'll look into the feasibility of getting that setup.

Close methods of PushMeterRegistry and StepMeterRegistry should be fixed in a way that the publish method will get called only once.

We probably would not change those classes to fix a limitation that only affects Stackdriver. The StackdriverMeterRegistry could have its own implementation of close that does not delegate to those parent classes. I'm not immediately sure what it would look like, though.

Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish.

I prefer to not make things configurable unless they need to be and there is enough benefit. It's a global Stackdriver limitation that reported time series can't be within a certain time of each other, right? If so, the minimum time is fixed and doesn't need to be configurable. That said, I don't especially like the idea of sleeping while something is waiting for the registry to close, but if it is short enough and there isn't a better way to solve this, it may be unavoidable. What is the minimum time difference required?

juergencodes commented 8 months ago

We probably would not change those classes to fix a limitation that only affects Stackdriver. The StackdriverMeterRegistry could have its own implementation of close that does not delegate to those parent classes. I'm not immediately sure what it would look like, though.

That's fair enough. My thought was most targeting towards if this double publish could potentially lead to issues for other technology adapters that I of course do not overview. I'm a bit skeptical with respect to an own implementation of close in StackdriverMeterRegistry, because you would again have to deal with the things that the super classes do, including the stuff in MeterRegistry. To my understanding, it would then make sense to follow only the path that StackdriverMeterRegistry makes sure to avoid time series with too little distance - no matter if it resulted from the two publish while closing, timing issues as described above or any other reason. I will change the title of the issue likewise.

Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish.

I prefer to not make things configurable unless they need to be and there is enough benefit. It's a global Stackdriver limitation that reported time series can't be within a certain time of each other, right? If so, the minimum time is fixed and doesn't need to be configurable. That said, I don't especially like the idea of sleeping while something is waiting for the registry to close, but if it is short enough and there isn't a better way to solve this, it may be unavoidable. What is the minimum time difference required?

I totally second to avoid unnecessary config options. However, as far as I know, the minimum time is configurable in GCP. For our setup I got the value of 5 s. I will talk with the team later to see if we have more meaningful information on this.

PS: While trying workarounds for this issue, we of course thought of subclassing StackdriverMeterRegistry to add the sleep logic in an override implementation. However, this does not work for the fact that StackdriverMeterRegistry most important constructor is private and available to the builder only. Hence, we were not able to instantiate it with MetricServiceSettings. In case you do not see anything that contradicts making the second constructor protected - or provide another one with the MetricServiceSettings, this would simplify working around future issues.