Closed timtebeek closed 1 year ago
@timtebeek Thanks for the issue!
ElasticMeterRegistry
is a StepMeterRegistry
which uses StepCounter
for Counter
whereas SimpleMeterRegistry
, which is used when there's no backend in Spring Boot, uses CumulativeCounter
. That's the reason why you see the behavioral difference. StepCounter
reports the number of events in the last complete interval as mentioned in its Javadoc, so if you check the value after its step, you can see what you expected.
I updated your sample to demonstrate what I said here.
Thanks for the quick response @izeye ! Had no idea StepMeterRegistry
and StepCounter
were involved to give this effect. It affects us most for short lived processes/jobs that increment a counter at the end of their run, before a last gasp push of metrics to Elastic. In these cases the behaviour we're seeing is that the counters are created with value 0, but aren't actually incremented before the final push and the process dies.
Are you suggesting we add a step sized (~1 minute) sleep to the end of our jobs to ensure the incremented counter values are pushed?
Or is there another way to force pushing all "current" values as a program is shutting down, rather than wait for a step duration?
@timtebeek The step-sized sleep was just for the demonstration, but I couldn't think of any way to push a value in an incomplete step with the current implementation except waiting for the completion of the step.
Can confirm adding the one minute sleep at the end of our jobs resolved the immediate issue we have with counters not showing up incremented, so thanks for that @izeye !
Now you've indicated that there's presently no way to push a value in an incomplete step, even when an application is shutting down and metrics are pushed as a result of that shutdown hook. Would it be worthwhile exploring such a use-case and keep this ticket open? Or do you suggest we close this one?
@timtebeek Thanks for the confirmation!
I think it would be nice if we could push a value in an incomplete step rather than waiting for a step to be completed in case of shutdown although I don't know how we could achieve it at the moment.
@shakuzen Thoughts?
Perfect thanks; I've (hopefully) updated the title to reflect this change in scope (feel free to refine as needed)
As a user of the micrometer for jobs it would be great if on application shutdown any values in incomplete metric steps are still pushed to Elastic or whatever else is configured.
I believe we have a previous ticket along these same lines - it came up for Influx as the backend if I remember correctly, but same use case. I do think we should have a better solution for this. I'll link the issues and add it to an upcoming release once I get done with the current release crunch.
Awesome, thanks for considering such a change!
Would really appreciate this one.
We run a big-data processing platform with a large number of processes, both long and short lived (where short-lived is even <10s, and long-lived even several hours, and we can't easily assume upfront whether a process will be short or long living), and we essentially have the same issue here - metrics from short-lived processes are not being published, therefore we really only have a subset of our processes being measured.
Putting a thread sleep call is not ideal for us, as due to high number of short-lived processes we run, this might raise our costs significantly. I can also see that there is a possibility of reducing this step size, but then I'm concerned about potential performance impact on long-running processes.
@shakuzen I'm wondering if you could recommend a preferred approach here. Perhaps we can modify the step at runtime? I'm not sure however if it's realistically achievable. I'd really appreciate your feedback on this.
Dear @shakuzen
Any news about this feature? I think that Timers are in the same situation.
Hi @shakuzen
I have a similar use case but with Influx Meter Registry. In a graceful shutdown it don't report metrics due to incomplete step. I have really interested in send metrics with short-lived processes or restarted services use cases.
It has any estimated date to this feature?
Regards
Please review this PR https://github.com/micrometer-metrics/micrometer/pull/2519
I've made a note for our 1.8 planning to holistically look at our support for metrics in short-lived processes.
Thanks @shakuzen , is possible to include this support in 1.3.x too ?. It will be very useful to me has this feature. Regards
This seems to be a bigger issue in Google Cloud Run. We are not given any guarantees when an instance will shutdown, and Google only gives you 10 seconds to execute your shutdown hooks. So if the step interval does not overlap with that 10 second window, those metrics are dropped.
This is a real problem for us because it seems like Cloud Run instances like to shut down much more frequently if requests are failing. So we are not receiving the right number of request failure metrics, and thus not triggering our alerts on internal production errors. This should be a very high priority issue.
@ajs1998, I think if you can get the instance of the CompositeMeterRegistry
and then call close
on it within your shutdown hook, you should be ok. I ran into this with Miconaut/Micrometer and it turned out the framework wasn't calling close.
@ajs1998, I think if you can get the instance of the
CompositeMeterRegistry
and then callclose
on it within your shutdown hook, you should be ok.
I have tried this but it does not seem to help. I haven't done enough debugging to figure out why though. If I run the service locally, everything works fine but if it's run in cloud run, the reported metrics are very flaky
The only reliable way I got it working in AWS Lambda, prior to fixing the bug in Micornaut, was to publish metrics around every request that was handled.
Hi, I've found a related issue about this issue.
I have a batch only increase a counter then exits. This will reproduce the existing issue here.
if I make the batch to wait for 1 minute after execution, I will have duplicated metrics in the same minute. Because the first scheduled publish and the publishSafely()
in close()
in PushMeterRegistry
both send the metrics in that minute. (Full PoC
This can be also fixed by the PR#2519 @garridobarrera send if I understand correctly. About the fix here, it requires the implementations (like New Relic..etc) using PushMeterRegistry
/ StepMeterRegistry
to implement publishGracefulShutdown()
.
Is this approach to fix already confirmed? I haven't seen the PR merged.
I have another experiment fix to add a wait time on shutdown for the next metric publishing. This will make batches run longer (0 ~ stepMillis). But it won't require implementations to change their code.
+1 on this issue. I have tried to diagnose this issue and have created a small example for this scenario here: https://github.com/pirgeo/micrometer-counter-exporter-example
It feels quite counter-intuitive to me that the values from the previous export interval are exported a second time.
Is this something you would expect to be fixed in the respective Registries, or are you considering a change to the underlying Counter
instrument?
@shakuzen This issue seems to be there for around 2 years. And it's really hurting the short-lived batches where the data is never reported more than the long-lived applications. I have added a PR which addresses this issue.
@lenin-jaganathan I created a PoC to solve this issue but some more exploratory work is needed, will look into it more soon.
Voting for this as well. This seems like a major flaw, you have to sleep at the end of each process termination or you will lose the last batch of meter values. This really kills things like spark where you are launching short lived processes to do some work and then terminating.
Is this being considered for 1.11
or do you think this is further out?
Is this being considered for
1.11
or do you think this is further out?
It's a high priority. We're planning on getting it into the 1.11 release. I've updated the milestone.
Hey, I have a very similar problem during recording data by DistributionSummary and publishing using ElasticMeterRegistry with 1min step. So while fixing StepCounter, please take a look at StepDistributionSummary too.
PS. I have created a little, semi-workaround by adding my own forcePublish method, overriding super "publish" method and comment it's body. Then during every job invocation I call 'forcePublish' method (code below). It works like a charm, except first and second job invocation. During first job invocation and publish Counter is {count=0} and DistributionSummary is {count=0, max=(correct value), mean=0, sum=0}. Then second job invocation publish correct data from actual and before invocations. Next jobs invocations works/publish data correctly.
!!!This listing is not real in 100%. It's just to show idea.
public class MicrometerElasticRegistry extends ElasticMeterRegistry {
public MicrometerElasticRegistry() {
super(ElasticConfig.DEFAULT, Clock.SYSTEM);
}
public void forcePublish() {
log.info("FORCE PUBLISH");
super.publish();
}
@Override
protected void publish() {
// super.publish();
}
}
public static class TestApplication {
public static void main(String[] args) {
MicrometerElasticRegistry registry = new MicrometerElasticRegistry();
Counter testCounter = Counter.builder("testCounter").register(registry);
DistributionSummary testDist = DistributionSummary.builder("testDist").register(registry);
for (int i = 0; i < 10; i++) {
double value = new Random().nextDouble();
invokeJob(registry, testCounter, testDist, value);
}
}
public static void invokeJob(MicrometerElasticRegistry registry, Counter testCounter,
DistributionSummary testDist, double value) {
testCounter.increment(value);
testDist.record(value);
registry.forcePublish();
}
}
We've committed a fix for this that will go out shortly in the 1.11.0-M2 milestone release. We would really appreciate if folks affected by this could try out that milestone and give any feedback on this. Let us know if this is not working in some cases. We can consider backporting this fix to maintenance releases, but we want to gauge how much that is needed and have testing done by users on this in the milestone version first.
Hey, Thanks for fix. It is better but not ideal.
In context of AWS Lambda, metrics still are not published on Lambda invocation shutdown.
But with this version I'd successfully created workaround and now its working correctly.
As in previous post, I have forcePublish method which calls super.publish() in my registry class
public class MicrometerElasticRegistry extends ElasticMeterRegistry {
public MicrometerElasticRegistry() {
super(ElasticConfig.DEFAULT, Clock.SYSTEM);
}
public void forcePublish() {
log.info("FORCE PUBLISH");
super.publish();
}
}
MicrometerElasticRegistry myRegistry = new MicrometerElasticRegistry();
Counter counter = Counter.builder("counterName").register(myRegistry);
counter.increment(100);
if (LambdaUtil.isColdStart()) myRegistry.close(); else myRegistry.forcePublish();
So, as you can see, when it is Lambda Cold Start I call close() on myRegistry. And when it is Lambda Warm Start I call forcePublish() on myRegistry. If I will call myRegistry.forcePublish() during Lambda Cold Start metrics will be send with zeros. And similar zeros metrics will be send if I will call myRegistry.close() during Lambda Warm Start.
This workaround is NOT working for 1.10.4 neither 1.10.5 -> Lambda Cold Starts sending metrics with 0 values (the same as I'd described in my previous post)
Hope that will give You at least track where the problem is.
Feel free to ask for more info ;)
What do you mean by "Lambda invocation shutdown" publishing metrics after each invocation or publishing them during JVM shutdown?
What is the context of the second code block ("lambda flow"), is it running for every request or is it running once for the lifetime of that JVM (e.g.: static initializer)?
Where is LambdaUtil
coming from? Is that something you wrote?
I'm not sure what is the difference in terms of shipping metrics if the JVM is receiving the first request (cold start) or the Nth ("warm").
Could you please try initialize your registry in a static initializer and also register a shutdown hook Runtime.getRuntime().addShutdownHook(...)
for closing it? Here's an example: https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/tree/main/java-demo
I've changed/extended my code a little for better understanding.
MicrometerElasticRegistry class:
public class MicrometerElasticRegistry extends ElasticMeterRegistry {
public MicrometerElasticRegistry() {
super(ElasticConfig.DEFAULT, Clock.SYSTEM);
}
public void forcePublish() {
super.publish();
}
public void incrementCounter(String metricsName, long count) {
getCounter(metricsName).increment(count);
}
private Counter getCounter(String metricsName) {
try {
return this.get(metricsName).counter();
} catch (MeterNotFoundException e) {
return Counter.builder(metricsName).register(this);
}
}
}
Lambda main handler class (ver. 1):
@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
private static boolean coldStart = true;
@Override
public Void handleRequest(KinesisEvent event, Context context) {
try {
metricsRegistry.incrementCounter("test_counter", new Random().nextInt(100) + 1);
} catch (Exception ex) {
log.error("Exception", ex);
} finally {
coldStart = false;
metricsRegistry.close();
}
return null;
}
}
With that code (ver. 1) Micrometer sends correct value of counter 'test_counter' only during cold start. On every Nth warm start Micrometer sends 0. I'm dont know why, but I think its strongly connected with static initializing MicrometerElasticRegistry which is done only on cold start. Shutdown hook is not working for me so I've decided to close registry in finally block which should work more/less similar.
If I apply my previous workaround to that code it will looks like this:
Lambda main handler class (ver. 2 - with workaround):
@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
private static boolean coldStart = true;
@Override
public Void handleRequest(KinesisEvent event, Context context) {
try {
metricsRegistry.incrementCounter("test.counter", new Random().nextInt(100) + 1);
} catch (Exception ex) {
log.error("Exception", ex);
} finally {
if (coldStart) metricsRegistry.close();
else metricsRegistry.forcePublish();
coldStart = false;
}
return null;
}
}
And this code (ver. 2) will make Micrometer to send correct values on cold and warm starts.
With that code (ver. 1) Micrometer sends correct value of counter 'test_counter' only during cold start. On every Nth warm start Micrometer sends 0. I'm dont know why, but I think its strongly connected with static initializing
I'm still not getting what should be the difference for your function between receiving the first request or receiving the second. It should not matter how many requests your function has processed. Also, you should not call publish
for each request and you definitely should not call close
. After you call close
, the registry is useless and since you have it in a static block you are keeping that registry for the whole JVM lifespan. So every time AWS starts a JVM it will be ok only for the first request and then every subsequent requests it should not work. :) (AWS does not start a new JVM for every request.)
Shutdown hook is not working for me so I've decided to close registry in finally block which should work more/less similar.
Could you please elaborate what is not working? Calling close in finally is not similar at all, see my comment above.
And this code (ver. 2) will make Micrometer to send correct values on cold and warm starts.
This is still publishing for every request and I think you should not do that. Could you please try something like this:
@Log4j2
public class Consumer implements RequestHandler<KinesisEvent, Void> {
// You might want to configure a more frequent schedule for publishing to test things,
// by default it is 1m but you can set it for 30s, 10s, its up to you.
private static final MeterRegistry registry = new ElasticMeterRegistry(...);
static {
Runtime.getRuntime().addShutdownHook(new Thread(Consumer::shutdown));
}
private void shutdown() {
logger.info("Shutting down...");
registry.close();
}
@Override
public Void handleRequest(KinesisEvent event, Context context) {
logger.info("Handling a request...");
// Maybe you can add the PID as a tag for testing purposes
// so that you can see how many lambdas are receiving traffic and publishing data.
// Please do not do this in prod, since it can have high enough cardinality
// to cause issues for you.
registry.counter("test.counter").increment();
return null;
}
}
I've copy paste Your code with default 1m step and deployed on my AWS Lambda. BTW. My Lambda is triggered by Kinesis Event every 60seconds using Lambda batch window feature.
I'm still not getting what should be the difference for your function between receiving the first request or receiving the second. It should not matter how many requests your function has processed. Also, you should not call publish for each request and you definitely should not call close. After you call close, the registry is useless and since you have it in a static block you are keeping that registry for the whole JVM lifespan. So every time AWS starts a JVM it will be ok only for the first request and then every subsequent requests it should not work. :) (AWS does not start a new JVM for every request.)
Could you please elaborate what is not working? Calling close in finally is not similar at all, see my comment above.
This is still publishing for every request and I think you should not do that.
Could you please try something like this:
Could you please explain me what do you mean by "during cold start" and "during warm start"? It confuses me a lot. This is what I think:
In every minute though, Micrometer should emit metrics. Your screenshot above proves this.
If you see a data point that is 0, it means that in that minute your function did not receive any requests. Maybe a request never arrived (preceded and followed by 1 in your case) or it arrived before (or after) Micrometer published metrics so it will be accounted in the previous (or next) publish (preceded or followed by 2 in your case). You can cross reference this with the "Handling a request..."
log entries.
After deploy your code several times I don't have "Shutting down..." in my logs - so I can assume shutdown() is not called.
This either means that your lambda never shuts down or for some reason the shutdown hook is not called. I think your lambda never shuts down if it gets a request every minute. I would start investigating this first, I think this should simply tell you when a cold start happened:
static {
logger.info("Cold start!");
Runtime.getRuntime().addShutdownHook(new Thread(Consumer::shutdown));
}
If your lambda does not shut down, you should see one "Cold start!"
entry (after deployment) and that's it. Also, you should see "Shutting down..."
when you deploy a new version. If not, please contact AWS support since the code above is from their official example.
Actually its what I want to achieve. My Lambda is triggered every 60 seconds by Kinesis and I want to send data from that Kinesis event to Elastic on every Lambda invocation.
Ok, now I get it. I think the misunderstanding lies in the behavior of StepMeterRegistry
. StepMeterRegistry
(ElasticMeterRegistry
) publishes data with a certain frequency, not when you want but when the registry schedules it. You can set the step
property in their config to modify this. Since you called close, you killed the scheduler so the registry did not send more data on its own you needed to publish manually. I would not recommend doing this. If you want to close the registry right after you create it in the static initializer so that you don't need to handle any cold/warm cases. But I'm not sure I can tell the consequences of doing this. Or you can write your own PushMeterRegistry
that publishes to Elastic. Based on this, I think the fix what this issue is about has nothing to do with your use-case.
I think what is happening on your screenshot is ok (if one request never arrived), could you please verify that the number of "Handling a request..."
messages matches to what Micrometer published to elastic?
OK, everything is clear now. It's just my misunderstanding about this issue.
I've created my custom PushMeterRegistry and now its work like a charm without any workarounds.
Thanks for Your help and time.
@shakuzen / @jonatan-ivanov Now that the issues around this fix are completed. Can this be considered a candidate for 1.9.x patch?
Can this be considered a candidate for 1.9.x patch?
Eventually, I think so. Recent follow-up fixes related to this are only available in snapshots so far, and we haven't received much feedback that people have tried this and verified it is working well for them and not causing any other issues. 1.11.0-RC1 will be released on Monday. Hopefully that will be an opportunity to get more people to try it out. After that, we can consider the risk of backporting, and I'm cautiously optimistic we can do it. We haven't got to a GA release with the current solution yet, so there may be some more tinkering, and I would rather do that in milestone/RC releases than in GA patch releases.
Edit: opened #3759 for backporting
@ppawlowski15 @jonatan-ivanov I am running into similar issue while publishing metrics from lambda. If I set, "cloudwatch.step" value to 1 second and if my lambda finishes before 1 second, I am not seeing any metrics. Shutdown hooks do not work for AWS lambda. From your discussions, I am not clear on how I can implement custom PushMeterRegistry as CloudWatchRegistry is extended from StepRegistry. Any examples or pointers would be helpful. Thanks!
@supriya-albal-polestar could you please open a new issue with details of what you're seeing? As far as I remember, shutdown hooks do work on AWS Lambda for Java, but the process will not be shutdown after each invocation, and it may take some time after the last invocation before the process is shutdown.
Thanks @shakuzen . I have raised - https://github.com/micrometer-metrics/micrometer/issues/4060. I tried waiting for some long time, still was not able to see metrics on adding shutdown hook.
Hi! Had posted this in Slack, but thought to report it here as well pending response there
In short: I have a Spring Boot 2.2.5 application with
spring-boot-starter-actuator
andmicrometer-registry-elastic
; when I setmanagement.metrics.export.elastic.enabled=true
, then my counters do not increment anymore.Not sure if this should be a micrometer issue or related to Spring Boot;
ElasticMeterRegistry
is in micrometer whereasElasticMetricsExportAutoConfiguration
is in Spring Boot. But since the latter seems to mostly delegate and is rather minimal otherwise, I thought to report it here first.To replicate the problem:
Create a new project with: https://start.spring.io/#!type=maven-project&language=java&platformVersion=2.2.5.RELEASE&packaging=jar&jvmVersion=11&groupId=com.example&artifactId=demo&name=demo&description=Demo%20project%20for%20Spring%20Boot&packageName=com.example.demo&dependencies=actuator
Add a dependency on micrometer-registry-elastic
Create a test class containing
If you then run both tests you will get the log output that's listed in each test method.
I would expect the Counter to increment regardless of whether metrics are exported; What can be done to make this so the counters increase?