Closed radekmie closed 3 months ago
@landland would you be able to try out the new https://github.com/radekmie/changestream-to-redis/tree/metrics branch? It exposes three metrics: time of the last processed MongoDB event, number of processed MongoDB events, and number of Redis invocations. The last two should be equal at all times, and the first should not fall far behind.
Cool thank you! It looks like it is running. Here's some sample metrics output:
# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1719169380
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 39
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 39
I'll keep this in testing for a week and then if all is well I'll push it live. Question, could you clarify what you mean by "the first should not fall far behind"? I guess you mean that it should be close to the current time, and if it's really old and not close to the current time, something is not quite right? And what would it mean if the last two metrics are not equal?
Thanks again for this. 🙏
last_event
should be roughly the current timestamp, assuming your database has no "breaks". If it falls behind, e.g., shows a few minutes old timestamp, that may be the reason we see those lacks of reactivity.mongo
, and after it is processed, it is pushed to Redis, increasing redis
. They should always be the same as long as there's no huge number of events being processed.It looks like mongo
and redis
did indeed deviate.
# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1720224971
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 36492903
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 36491878
The last_event
is about 2-3 minutes behind the time I grabbed the metrics. MongoDB CPU running around 36%. Redis around 66% which is higher than usual.
When I refreshed the metrics last_event
was pretty much within seconds, and mongo
and redis
not equal.
Any ideas on what might be happening or what to check next?
Wow, I didn't expect Redis to be the issue. So let's recap:
changestream-to-redis
cannot catch up with MongoDB.redis
and mongo
is exactly 1025, which makes sense since the buffer has 1024 slots.A few questions then:
DEDUPLICATION
flag and deploying multiple changestream-to-redis
instances?changestream-to-redis
from your Redis instance (e.g., is it the same data center)? Is there some unnecessary network hop you could get rid of? (I'm asking since I think Redis would hit almost 100% with low-enough latency.)changestream-to-redis
and Redis? (I don't expect the former to be an issue, but maybe the latter...? I don't know yet.)My responses:
changestream-to-redis
instance and no DEDUPLICATION
vCPU
. And changestream-to-redis
has 2 vCPU
using Premium Intel
as defined by Digital Ocean. Is there something more I can check here that can give you a better answer? Do you think upgrading the Redis db specs would help?By the way, I noticed this after re-installing monti apm
. We were running into an incompatibility issue with monti apm
and redis-oplog
that was fixed. Last night I deployed it live, and that's when I noticed the app non-reactivity and then the metrics discrepancy.
In that case I'd suggest throwing in one more CPU for Redis and reducing changestream-to-redis
to one (I strongly believe it will be enough, but please do confirm with your metrics so far).
In the meantime, I'll think about whether I could optimize Redis processing, e.g., by batching sent events.
This incompatibility with the MontiAPM agent doesn't seem like a problem since it was happening on the server side. Thanks for the heads up!
Thank you! I've upgraded Redis with more RAM and 2 vCPU
. I'll let you know how it goes. I checked the metrics and mongo
and redis
are equal again.
It seems like the upgraded Redis didn't entirely solve the issue - when MongoDB gets busy metrics shows a difference of 1025 between redis
and mongo
as you previously noted and it takes quite a while to recover. Hopefully you'll be able to come up with a way to optimize the Redis processing 🤞
Ouch. And do you maybe graph those metrics? Or at least collect them periodically? I'd like to see, how many events per second/minute we're talking about here.
I can do that the next time this happens. Or do you mean even when it isn't happening as well?
Ideally you'd collect them 24/7. These are Prometheus-style metrics, so you can gather them and chart in Grafana or a similar tool.
Feel free to tell me this is beyond the scope of what you can help with, but I'm having a heck of a time getting Prometheus to return JSON, as required by Grafana. It seems when I try to add the source to Grafana I get the error "ReadObject: expect { or , or } or n, but found #, error found in #1 byte of...
".
It's returning the text format. According to this article from Prometheus, I'm supposed to query the API and use the endpoints /api/v1/query
.
When I try to curl from the host machine of the changestream-to-redis docker container, I get the same text format.
curl http://172.17.0.2:4000/api/v1/query?query=up
# HELP last_event Timestamp of last MongoDB event
# TYPE last_event gauge
last_event 1721352857
# HELP mongo Number of MongoDB events
# TYPE mongo counter
mongo 205
# HELP redis Number of Redis invokes
# TYPE redis counter
redis 205
172.17.0.2
is the changestream-to-redis docker internal IP. So it seems like I'm not able to get the JSON and so Grafana won't connect. Any ideas?
It works like this:
changestream-to-redis
is your Prometheus data source (metric source, target, etc.).And it looks like you skipped Prometheus Server and wanted to connect Grafana to changestream-to-redis
directly.
That was it, you are the man. Thank you 🙏. I have it collecting now so I can share some graphs with you the next time this happens.
As the readme says:
It feels even more important now, since #2 seems not debuggable without them. For starters, it would be fine to have only the following: