Expose service metrics - Githubissues

radekmie commented 5 months ago

As the readme says:

Limitations No monitoring. There is no monitoring of any kind, but both a health-checking endpoint and Prometheus metrics are planned.

It feels even more important now, since #2 seems not debuggable without them. For starters, it would be fine to have only the following:

Time of the last processed event, i.e., lag.
- Twice, once for each stream.
Number of processed events in the last N seconds.
- Maybe per collection?

radekmie commented 4 months ago

@landland would you be able to try out the new https://github.com/radekmie/changestream-to-redis/tree/metrics branch? It exposes three metrics: time of the last processed MongoDB event, number of processed MongoDB events, and number of Redis invocations. The last two should be equal at all times, and the first should not fall far behind.

landland commented 4 months ago

Cool thank you! It looks like it is running. Here's some sample metrics output:

# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1719169380
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 39
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 39

I'll keep this in testing for a week and then if all is well I'll push it live. Question, could you clarify what you mean by "the first should not fall far behind"? I guess you mean that it should be close to the current time, and if it's really old and not close to the current time, something is not quite right? And what would it mean if the last two metrics are not equal?

Thanks again for this. 🙏

radekmie commented 4 months ago

last_event should be roughly the current timestamp, assuming your database has no "breaks". If it falls behind, e.g., shows a few minutes old timestamp, that may be the reason we see those lacks of reactivity.
Every event that comes in increases mongo, and after it is processed, it is pushed to Redis, increasing redis. They should always be the same as long as there's no huge number of events being processed.

landland commented 3 months ago

It looks like mongo and redis did indeed deviate.

# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1720224971
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 36492903
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 36491878

The last_event is about 2-3 minutes behind the time I grabbed the metrics. MongoDB CPU running around 36%. Redis around 66% which is higher than usual.

When I refreshed the metrics last_event was pretty much within seconds, and mongo and redis not equal.

Any ideas on what might be happening or what to check next?

radekmie commented 3 months ago

Wow, I didn't expect Redis to be the issue. So let's recap:

The last processed event is 2-3 minutes old, so changestream-to-redis cannot catch up with MongoDB.
The difference between redis and mongo is exactly 1025, which makes sense since the buffer has 1024 slots.
That means MongoDB is not a problem, but pushing events to Redis is.

A few questions then:

Are you using the DEDUPLICATION flag and deploying multiple changestream-to-redis instances?
How "far" is the changestream-to-redis from your Redis instance (e.g., is it the same data center)? Is there some unnecessary network hop you could get rid of? (I'm asking since I think Redis would hit almost 100% with low-enough latency.)
How much CPU is available to the changestream-to-redis and Redis? (I don't expect the former to be an issue, but maybe the latter...? I don't know yet.)

landland commented 3 months ago

My responses:

No, just one changestream-to-redis instance and no DEDUPLICATION
They're both located in the same data center - NYC3 in Digital Ocean, as is the mongodb
This one I'm not sure how to answer. Redis is setup with one standby node, both have 1 vCPU. And changestream-to-redis has 2 vCPU using Premium Intel as defined by Digital Ocean. Is there something more I can check here that can give you a better answer? Do you think upgrading the Redis db specs would help?

By the way, I noticed this after re-installing monti apm. We were running into an incompatibility issue with monti apm and redis-oplog that was fixed. Last night I deployed it live, and that's when I noticed the app non-reactivity and then the metrics discrepancy.

radekmie commented 3 months ago

In that case I'd suggest throwing in one more CPU for Redis and reducing changestream-to-redis to one (I strongly believe it will be enough, but please do confirm with your metrics so far).

In the meantime, I'll think about whether I could optimize Redis processing, e.g., by batching sent events.

This incompatibility with the MontiAPM agent doesn't seem like a problem since it was happening on the server side. Thanks for the heads up!

landland commented 3 months ago

Thank you! I've upgraded Redis with more RAM and 2 vCPU. I'll let you know how it goes. I checked the metrics and mongo and redis are equal again.

landland commented 3 months ago

It seems like the upgraded Redis didn't entirely solve the issue - when MongoDB gets busy metrics shows a difference of 1025 between redis and mongo as you previously noted and it takes quite a while to recover. Hopefully you'll be able to come up with a way to optimize the Redis processing 🤞

radekmie commented 3 months ago

Ouch. And do you maybe graph those metrics? Or at least collect them periodically? I'd like to see, how many events per second/minute we're talking about here.

landland commented 3 months ago

I can do that the next time this happens. Or do you mean even when it isn't happening as well?

radekmie commented 3 months ago

Ideally you'd collect them 24/7. These are Prometheus-style metrics, so you can gather them and chart in Grafana or a similar tool.

landland commented 3 months ago

Feel free to tell me this is beyond the scope of what you can help with, but I'm having a heck of a time getting Prometheus to return JSON, as required by Grafana. It seems when I try to add the source to Grafana I get the error "ReadObject: expect { or , or } or n, but found #, error found in #1 byte of...".

It's returning the text format. According to this article from Prometheus, I'm supposed to query the API and use the endpoints /api/v1/query.

When I try to curl from the host machine of the changestream-to-redis docker container, I get the same text format.

curl http://172.17.0.2:4000/api/v1/query?query=up
# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1721352857
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 205
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 205

172.17.0.2 is the changestream-to-redis docker internal IP. So it seems like I'm not able to get the JSON and so Grafana won't connect. Any ideas?

radekmie commented 3 months ago

It works like this:

changestream-to-redis is your Prometheus data source (metric source, target, etc.).
Prometheus Server should be deployed next to it, collecting periodically asking for the metrics from all data sources.
Grafana connects to Prometheus Server.

And it looks like you skipped Prometheus Server and wanted to connect Grafana to changestream-to-redis directly.

landland commented 3 months ago

That was it, you are the man. Thank you 🙏. I have it collecting now so I can share some graphs with you the next time this happens.

radekmie / changestream-to-redis

Expose service metrics #3