micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.46k stars 989 forks source link

Publish partial step values on application shutdown for step meters #1882

Closed timtebeek closed 1 year ago

timtebeek commented 4 years ago

Hi! Had posted this in Slack, but thought to report it here as well pending response there

In short: I have a Spring Boot 2.2.5 application with spring-boot-starter-actuator and micrometer-registry-elastic; when I set management.metrics.export.elastic.enabled=true, then my counters do not increment anymore.

Not sure if this should be a micrometer issue or related to Spring Boot; ElasticMeterRegistry is in micrometer whereas ElasticMetricsExportAutoConfiguration is in Spring Boot. But since the latter seems to mostly delegate and is rather minimal otherwise, I thought to report it here first.

To replicate the problem:

  1. Create a new project with: https://start.spring.io/#!type=maven-project&language=java&platformVersion=2.2.5.RELEASE&packaging=jar&jvmVersion=11&groupId=com.example&artifactId=demo&name=demo&description=Demo%20project%20for%20Spring%20Boot&packageName=com.example.demo&dependencies=actuator

  2. Add a dependency on micrometer-registry-elastic

    <dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-elastic</artifactId>
    </dependency>
  3. Create a test class containing

@Test
void counterStaysAtZero() {
    DebugApplication.main("", "--management.metrics.export.elastic.enabled=true");
    // DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
    // DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
}
@Test
void counterAddsToOne() {
    DebugApplication.main("", "--management.metrics.export.elastic.enabled=false");
    // DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
    // DebugApplication : Counter: 1.0 , [Measurement{statistic='COUNT', value=1.0}]
}
@SpringBootApplication
@RequiredArgsConstructor
@Slf4j
public static class DebugApplication implements ApplicationRunner {
    public static void main(String... args) {
        log.info("{}", args[1]);
        SpringApplication.run(DebugApplication.class, args).close();
    }
    private final MeterRegistry registry;
    @Override
    public void run(ApplicationArguments args) throws Exception {
        Counter counter = registry.counter("someCounter");
        log.info("Counter: {} , {}", counter.count(), counter.measure());
        counter.increment();
        log.info("Counter: {} , {}", counter.count(), counter.measure());
    }
}

If you then run both tests you will get the log output that's listed in each test method.

I would expect the Counter to increment regardless of whether metrics are exported; What can be done to make this so the counters increase?

izeye commented 4 years ago

@timtebeek Thanks for the issue!

ElasticMeterRegistry is a StepMeterRegistry which uses StepCounter for Counter whereas SimpleMeterRegistry, which is used when there's no backend in Spring Boot, uses CumulativeCounter. That's the reason why you see the behavioral difference. StepCounter reports the number of events in the last complete interval as mentioned in its Javadoc, so if you check the value after its step, you can see what you expected.

I updated your sample to demonstrate what I said here.

timtebeek commented 4 years ago

Thanks for the quick response @izeye ! Had no idea StepMeterRegistry and StepCounter were involved to give this effect. It affects us most for short lived processes/jobs that increment a counter at the end of their run, before a last gasp push of metrics to Elastic. In these cases the behaviour we're seeing is that the counters are created with value 0, but aren't actually incremented before the final push and the process dies.

Are you suggesting we add a step sized (~1 minute) sleep to the end of our jobs to ensure the incremented counter values are pushed?

Or is there another way to force pushing all "current" values as a program is shutting down, rather than wait for a step duration?

izeye commented 4 years ago

@timtebeek The step-sized sleep was just for the demonstration, but I couldn't think of any way to push a value in an incomplete step with the current implementation except waiting for the completion of the step.

timtebeek commented 4 years ago

Can confirm adding the one minute sleep at the end of our jobs resolved the immediate issue we have with counters not showing up incremented, so thanks for that @izeye !

Now you've indicated that there's presently no way to push a value in an incomplete step, even when an application is shutting down and metrics are pushed as a result of that shutdown hook. Would it be worthwhile exploring such a use-case and keep this ticket open? Or do you suggest we close this one?

izeye commented 4 years ago

@timtebeek Thanks for the confirmation!

I think it would be nice if we could push a value in an incomplete step rather than waiting for a step to be completed in case of shutdown although I don't know how we could achieve it at the moment.

@shakuzen Thoughts?

timtebeek commented 4 years ago

Perfect thanks; I've (hopefully) updated the title to reflect this change in scope (feel free to refine as needed)

As a user of the micrometer for jobs it would be great if on application shutdown any values in incomplete metric steps are still pushed to Elastic or whatever else is configured.

shakuzen commented 4 years ago

I believe we have a previous ticket along these same lines - it came up for Influx as the backend if I remember correctly, but same use case. I do think we should have a better solution for this. I'll link the issues and add it to an upcoming release once I get done with the current release crunch.

timtebeek commented 4 years ago

Awesome, thanks for considering such a change!

pnaw94 commented 4 years ago

Would really appreciate this one.

We run a big-data processing platform with a large number of processes, both long and short lived (where short-lived is even <10s, and long-lived even several hours, and we can't easily assume upfront whether a process will be short or long living), and we essentially have the same issue here - metrics from short-lived processes are not being published, therefore we really only have a subset of our processes being measured.

Putting a thread sleep call is not ideal for us, as due to high number of short-lived processes we run, this might raise our costs significantly. I can also see that there is a possibility of reducing this step size, but then I'm concerned about potential performance impact on long-running processes.

@shakuzen I'm wondering if you could recommend a preferred approach here. Perhaps we can modify the step at runtime? I'm not sure however if it's realistically achievable. I'd really appreciate your feedback on this.

admoca60 commented 3 years ago

Dear @shakuzen

Any news about this feature? I think that Timers are in the same situation.

garridobarrera commented 3 years ago

Hi @shakuzen
I have a similar use case but with Influx Meter Registry. In a graceful shutdown it don't report metrics due to incomplete step. I have really interested in send metrics with short-lived processes or restarted services use cases. It has any estimated date to this feature?

Regards

garridobarrera commented 3 years ago

Please review this PR https://github.com/micrometer-metrics/micrometer/pull/2519

shakuzen commented 3 years ago

I've made a note for our 1.8 planning to holistically look at our support for metrics in short-lived processes.

garridobarrera commented 3 years ago

Thanks @shakuzen , is possible to include this support in 1.3.x too ?. It will be very useful to me has this feature. Regards

ajs1998 commented 2 years ago

This seems to be a bigger issue in Google Cloud Run. We are not given any guarantees when an instance will shutdown, and Google only gives you 10 seconds to execute your shutdown hooks. So if the step interval does not overlap with that 10 second window, those metrics are dropped.

This is a real problem for us because it seems like Cloud Run instances like to shut down much more frequently if requests are failing. So we are not receiving the right number of request failure metrics, and thus not triggering our alerts on internal production errors. This should be a very high priority issue.

tburch commented 2 years ago

@ajs1998, I think if you can get the instance of the CompositeMeterRegistry and then call close on it within your shutdown hook, you should be ok. I ran into this with Miconaut/Micrometer and it turned out the framework wasn't calling close.

ajs1998 commented 2 years ago

@ajs1998, I think if you can get the instance of the CompositeMeterRegistry and then call close on it within your shutdown hook, you should be ok.

I have tried this but it does not seem to help. I haven't done enough debugging to figure out why though. If I run the service locally, everything works fine but if it's run in cloud run, the reported metrics are very flaky

tburch commented 2 years ago

The only reliable way I got it working in AWS Lambda, prior to fixing the bug in Micornaut, was to publish metrics around every request that was handled.

joshua-paypay commented 2 years ago

Hi, I've found a related issue about this issue.

I have a batch only increase a counter then exits. This will reproduce the existing issue here. if I make the batch to wait for 1 minute after execution, I will have duplicated metrics in the same minute. Because the first scheduled publish and the publishSafely() in close() in PushMeterRegistry both send the metrics in that minute. (Full PoC

This can be also fixed by the PR#2519 @garridobarrera send if I understand correctly. About the fix here, it requires the implementations (like New Relic..etc) using PushMeterRegistry / StepMeterRegistry to implement publishGracefulShutdown().

Is this approach to fix already confirmed? I haven't seen the PR merged.

I have another experiment fix to add a wait time on shutdown for the next metric publishing. This will make batches run longer (0 ~ stepMillis). But it won't require implementations to change their code.

pirgeo commented 2 years ago

+1 on this issue. I have tried to diagnose this issue and have created a small example for this scenario here: https://github.com/pirgeo/micrometer-counter-exporter-example

It feels quite counter-intuitive to me that the values from the previous export interval are exported a second time.

Is this something you would expect to be fixed in the respective Registries, or are you considering a change to the underlying Counter instrument?

lenin-jaganathan commented 1 year ago

@shakuzen This issue seems to be there for around 2 years. And it's really hurting the short-lived batches where the data is never reported more than the long-lived applications. I have added a PR which addresses this issue.

jonatan-ivanov commented 1 year ago

@lenin-jaganathan I created a PoC to solve this issue but some more exploratory work is needed, will look into it more soon.

mbbeaubi commented 1 year ago

Voting for this as well. This seems like a major flaw, you have to sleep at the end of each process termination or you will lose the last batch of meter values. This really kills things like spark where you are launching short lived processes to do some work and then terminating.

pirgeo commented 1 year ago

Is this being considered for 1.11 or do you think this is further out?

shakuzen commented 1 year ago

Is this being considered for 1.11 or do you think this is further out?

It's a high priority. We're planning on getting it into the 1.11 release. I've updated the milestone.

ppawlowski15 commented 1 year ago

Hey, I have a very similar problem during recording data by DistributionSummary and publishing using ElasticMeterRegistry with 1min step. So while fixing StepCounter, please take a look at StepDistributionSummary too.

PS. I have created a little, semi-workaround by adding my own forcePublish method, overriding super "publish" method and comment it's body. Then during every job invocation I call 'forcePublish' method (code below). It works like a charm, except first and second job invocation. During first job invocation and publish Counter is {count=0} and DistributionSummary is {count=0, max=(correct value), mean=0, sum=0}. Then second job invocation publish correct data from actual and before invocations. Next jobs invocations works/publish data correctly.

!!!This listing is not real in 100%. It's just to show idea.

public class MicrometerElasticRegistry extends ElasticMeterRegistry {
    public MicrometerElasticRegistry() {
        super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }

    public void forcePublish() {
        log.info("FORCE PUBLISH");
        super.publish();
    }

    @Override
    protected void publish() {
//        super.publish();
    }

}

    public static class TestApplication {
        public static void main(String[] args) {
            MicrometerElasticRegistry registry = new MicrometerElasticRegistry();
            Counter testCounter = Counter.builder("testCounter").register(registry);
            DistributionSummary testDist = DistributionSummary.builder("testDist").register(registry);
            for (int i = 0; i < 10; i++) {
                double value = new Random().nextDouble();
                invokeJob(registry, testCounter, testDist, value);
            }
        }

        public static void invokeJob(MicrometerElasticRegistry registry, Counter testCounter,
                                     DistributionSummary testDist, double value) {
            testCounter.increment(value);
            testDist.record(value);
            registry.forcePublish();
        }
    }
shakuzen commented 1 year ago

We've committed a fix for this that will go out shortly in the 1.11.0-M2 milestone release. We would really appreciate if folks affected by this could try out that milestone and give any feedback on this. Let us know if this is not working in some cases. We can consider backporting this fix to maintenance releases, but we want to gauge how much that is needed and have testing done by users on this in the milestone version first.

ppawlowski15 commented 1 year ago

Hey, Thanks for fix. It is better but not ideal.

In context of AWS Lambda, metrics still are not published on Lambda invocation shutdown.

But with this version I'd successfully created workaround and now its working correctly.

  1. As in previous post, I have forcePublish method which calls super.publish() in my registry class

    public class MicrometerElasticRegistry extends ElasticMeterRegistry {
    public MicrometerElasticRegistry() {
        super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }
    
    public void forcePublish() {
        log.info("FORCE PUBLISH");
        super.publish();
    }
    }
  2. Lambda flow
    
    MicrometerElasticRegistry myRegistry = new MicrometerElasticRegistry();
    Counter counter = Counter.builder("counterName").register(myRegistry);
    counter.increment(100);

if (LambdaUtil.isColdStart()) myRegistry.close(); else myRegistry.forcePublish();


So, as you can see, when it is Lambda Cold Start I call close() on myRegistry. And when it is Lambda Warm Start I call forcePublish() on myRegistry. If I will call myRegistry.forcePublish() during Lambda Cold Start metrics will be send with zeros. And similar zeros metrics will be send if I will call myRegistry.close() during Lambda Warm Start.
This workaround is NOT working for 1.10.4 neither 1.10.5 -> Lambda Cold Starts sending metrics with 0 values (the same as I'd described in my previous post)

Hope that will give You at least track where the problem is.

Feel free to ask for more info ;)
jonatan-ivanov commented 1 year ago

What do you mean by "Lambda invocation shutdown" publishing metrics after each invocation or publishing them during JVM shutdown?

What is the context of the second code block ("lambda flow"), is it running for every request or is it running once for the lifetime of that JVM (e.g.: static initializer)?

Where is LambdaUtil coming from? Is that something you wrote?

I'm not sure what is the difference in terms of shipping metrics if the JVM is receiving the first request (cold start) or the Nth ("warm").

Could you please try initialize your registry in a static initializer and also register a shutdown hook Runtime.getRuntime().addShutdownHook(...) for closing it? Here's an example: https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/tree/main/java-demo

ppawlowski15 commented 1 year ago

I've changed/extended my code a little for better understanding.

MicrometerElasticRegistry class:

public class MicrometerElasticRegistry extends ElasticMeterRegistry {

    public MicrometerElasticRegistry() {
      super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }

    public void forcePublish() {
        super.publish();
    }

    public void incrementCounter(String metricsName, long count) {
        getCounter(metricsName).increment(count);
    }

    private Counter getCounter(String metricsName) {
        try {
            return this.get(metricsName).counter();
        } catch (MeterNotFoundException e) {
            return Counter.builder(metricsName).register(this);
        }
    }
}

Lambda main handler class (ver. 1):

@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
    private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
    private static boolean coldStart = true;

    @Override
    public Void handleRequest(KinesisEvent event, Context context) {
        try {
            metricsRegistry.incrementCounter("test_counter", new Random().nextInt(100) + 1);
        } catch (Exception ex) {
            log.error("Exception", ex);
        } finally {
            coldStart = false;
            metricsRegistry.close();
        }
        return null;
    }
}

With that code (ver. 1) Micrometer sends correct value of counter 'test_counter' only during cold start. On every Nth warm start Micrometer sends 0. I'm dont know why, but I think its strongly connected with static initializing MicrometerElasticRegistry which is done only on cold start. Shutdown hook is not working for me so I've decided to close registry in finally block which should work more/less similar.

If I apply my previous workaround to that code it will looks like this:

Lambda main handler class (ver. 2 - with workaround):

@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
    private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
    private static boolean coldStart = true;

    @Override
    public Void handleRequest(KinesisEvent event, Context context) {
        try {
            metricsRegistry.incrementCounter("test.counter", new Random().nextInt(100) + 1);
        } catch (Exception ex) {
            log.error("Exception", ex);
        } finally {
            if (coldStart) metricsRegistry.close();
            else metricsRegistry.forcePublish();

            coldStart = false;
        }
        return null;
    }
}

And this code (ver. 2) will make Micrometer to send correct values on cold and warm starts.

jonatan-ivanov commented 1 year ago

With that code (ver. 1) Micrometer sends correct value of counter 'test_counter' only during cold start. On every Nth warm start Micrometer sends 0. I'm dont know why, but I think its strongly connected with static initializing

I'm still not getting what should be the difference for your function between receiving the first request or receiving the second. It should not matter how many requests your function has processed. Also, you should not call publish for each request and you definitely should not call close. After you call close, the registry is useless and since you have it in a static block you are keeping that registry for the whole JVM lifespan. So every time AWS starts a JVM it will be ok only for the first request and then every subsequent requests it should not work. :) (AWS does not start a new JVM for every request.)

Shutdown hook is not working for me so I've decided to close registry in finally block which should work more/less similar.

Could you please elaborate what is not working? Calling close in finally is not similar at all, see my comment above.

And this code (ver. 2) will make Micrometer to send correct values on cold and warm starts.

This is still publishing for every request and I think you should not do that. Could you please try something like this:

@Log4j2
public class Consumer implements RequestHandler<KinesisEvent, Void> {
    // You might want to configure a more frequent schedule for publishing to test things,
    // by default it is 1m but you can set it for 30s, 10s, its up to you.
    private static final MeterRegistry registry = new ElasticMeterRegistry(...);

    static {
        Runtime.getRuntime().addShutdownHook(new Thread(Consumer::shutdown));
    }

    private void shutdown() {
        logger.info("Shutting down...");
        registry.close();
    }

    @Override
    public Void handleRequest(KinesisEvent event, Context context) {
        logger.info("Handling a request...");
        // Maybe you can add the PID as a tag for testing purposes
        // so that you can see how many lambdas are receiving traffic and publishing data.
        // Please do not do this in prod, since it can have high enough cardinality
        // to cause issues for you.
        registry.counter("test.counter").increment();
        return null;
    }
}
ppawlowski15 commented 1 year ago

I've copy paste Your code with default 1m step and deployed on my AWS Lambda. BTW. My Lambda is triggered by Kinesis Event every 60seconds using Lambda batch window feature.

  1. During cold start micrometer doesn't send anything to my Elastic
  2. During warm start micrometer sends counter sometimes correctly, sometimes twice, sometimes nothing at all obraz

I'm still not getting what should be the difference for your function between receiving the first request or receiving the second. It should not matter how many requests your function has processed. Also, you should not call publish for each request and you definitely should not call close. After you call close, the registry is useless and since you have it in a static block you are keeping that registry for the whole JVM lifespan. So every time AWS starts a JVM it will be ok only for the first request and then every subsequent requests it should not work. :) (AWS does not start a new JVM for every request.)

  1. As I said previously, I don't know where is difference. Maybe there is no difference, but during cold start when registry is initialized, micrometer waiting full step time to send metrics. I've detect micrometer calls publish() AT VERY BEGINNING during n-warm starts.
  2. I know I shouldn't call close neither publish by my own, but I'm forced to it because micrometer is not sending metrics properly. As I said previously it's just workaround, I didn't said its correct way.
  3. I understand that calling close() should make registry useless but for some reason it doesn't. Although I calls close() in cold start (with my workaround ver. 2), micrometer sending metrics correctly. How it is possible? Bug in micrometer?

Could you please elaborate what is not working? Calling close in finally is not similar at all, see my comment above.

  1. After deploy your code several times I don't have "Shutting down..." in my logs - so I can assume shutdown() is not called.

This is still publishing for every request and I think you should not do that.

  1. Actually its what I want to achieve. My Lambda is triggered every 60 seconds by Kinesis and I want to send data from that Kinesis event to Elastic on every Lambda invocation. I thought Your fix will force micrometer to detect Lambda's call is finishing amd will send collected metrics into Elastic just before Lambda call finish totally ignoring micrometer step.

Could you please try something like this:

  1. It's not so easy for me to explain my results/insights even in my native language, without show real example. Could You create Your own AWS Lambda and test it?(You can use EventBridge as trigger to mock Kinesis). It will be easier to reproduce and explain my case.
jonatan-ivanov commented 1 year ago

Could you please explain me what do you mean by "during cold start" and "during warm start"? It confuses me a lot. This is what I think:

  1. To me "cold start" means that a new JVM process starts, a registry is created (static init), your function processes the request and the counter is incremented. Micrometer should not send anything anywhere during this process it should record that the counter is incremented and that's it.
  2. To me "warm start" does not make sense, nothing starts when the JVM is already up, your function processes the request and the counter is incremented. Micrometer should not send anything anywhere during this process it should record that the counter is incremented and that's it.

In every minute though, Micrometer should emit metrics. Your screenshot above proves this. If you see a data point that is 0, it means that in that minute your function did not receive any requests. Maybe a request never arrived (preceded and followed by 1 in your case) or it arrived before (or after) Micrometer published metrics so it will be accounted in the previous (or next) publish (preceded or followed by 2 in your case). You can cross reference this with the "Handling a request..." log entries.

After deploy your code several times I don't have "Shutting down..." in my logs - so I can assume shutdown() is not called.

This either means that your lambda never shuts down or for some reason the shutdown hook is not called. I think your lambda never shuts down if it gets a request every minute. I would start investigating this first, I think this should simply tell you when a cold start happened:

static {
    logger.info("Cold start!"); 
    Runtime.getRuntime().addShutdownHook(new Thread(Consumer::shutdown));
}

If your lambda does not shut down, you should see one "Cold start!" entry (after deployment) and that's it. Also, you should see "Shutting down..." when you deploy a new version. If not, please contact AWS support since the code above is from their official example.

Actually its what I want to achieve. My Lambda is triggered every 60 seconds by Kinesis and I want to send data from that Kinesis event to Elastic on every Lambda invocation.

Ok, now I get it. I think the misunderstanding lies in the behavior of StepMeterRegistry. StepMeterRegistry (ElasticMeterRegistry) publishes data with a certain frequency, not when you want but when the registry schedules it. You can set the step property in their config to modify this. Since you called close, you killed the scheduler so the registry did not send more data on its own you needed to publish manually. I would not recommend doing this. If you want to close the registry right after you create it in the static initializer so that you don't need to handle any cold/warm cases. But I'm not sure I can tell the consequences of doing this. Or you can write your own PushMeterRegistry that publishes to Elastic. Based on this, I think the fix what this issue is about has nothing to do with your use-case.

I think what is happening on your screenshot is ok (if one request never arrived), could you please verify that the number of "Handling a request..." messages matches to what Micrometer published to elastic?

ppawlowski15 commented 1 year ago

OK, everything is clear now. It's just my misunderstanding about this issue.

I've created my custom PushMeterRegistry and now its work like a charm without any workarounds.

Thanks for Your help and time.

lenin-jaganathan commented 1 year ago

@shakuzen / @jonatan-ivanov Now that the issues around this fix are completed. Can this be considered a candidate for 1.9.x patch?

shakuzen commented 1 year ago

Can this be considered a candidate for 1.9.x patch?

Eventually, I think so. Recent follow-up fixes related to this are only available in snapshots so far, and we haven't received much feedback that people have tried this and verified it is working well for them and not causing any other issues. 1.11.0-RC1 will be released on Monday. Hopefully that will be an opportunity to get more people to try it out. After that, we can consider the risk of backporting, and I'm cautiously optimistic we can do it. We haven't got to a GA release with the current solution yet, so there may be some more tinkering, and I would rather do that in milestone/RC releases than in GA patch releases.

Edit: opened #3759 for backporting

supriya-albal-polestar commented 1 year ago

@ppawlowski15 @jonatan-ivanov I am running into similar issue while publishing metrics from lambda. If I set, "cloudwatch.step" value to 1 second and if my lambda finishes before 1 second, I am not seeing any metrics. Shutdown hooks do not work for AWS lambda. From your discussions, I am not clear on how I can implement custom PushMeterRegistry as CloudWatchRegistry is extended from StepRegistry. Any examples or pointers would be helpful. Thanks!

shakuzen commented 1 year ago

@supriya-albal-polestar could you please open a new issue with details of what you're seeing? As far as I remember, shutdown hooks do work on AWS Lambda for Java, but the process will not be shutdown after each invocation, and it may take some time after the last invocation before the process is shutdown.

supriya-albal-polestar commented 1 year ago

Thanks @shakuzen . I have raised - https://github.com/micrometer-metrics/micrometer/issues/4060. I tried waiting for some long time, still was not able to see metrics on adding shutdown hook.