pnxtech / hydra

A light-weight library for building distributed applications such as microservices
https://www.hydramicroservice.com
MIT License
645 stars 54 forks source link

A warning about storing application data in the same Redis db as Hydra metadata #192

Closed rantecki closed 6 years ago

rantecki commented 6 years ago

I've just learned the hard way that it is NOT a good idea to mix your application data with Hydra's data in the default Redis database (db 0) if your services generate a lot of keys.

I've just spent an entire day trying to get to the bottom of very puzzling performance issues of a large system we launched a few days ago. As the number of keys in Redis kept climbing, the system got slower and slower until the point that it was unusable. After a lot of digging, it turned out that the SCAN MATCH command used by Hydra's checkPresence() function was the culprit. Apparently SCAN MATCH will iterate through the entire keyspace looking for a needle in a haystack. With a database of 100,000+ keys, that's 1000+ scan calls needed just to check service presence (with chunk size of 100). It easily took an R4.large Elasticache instance to its knees under moderate client load. Moving the app data out to a separate db immediately solved the issue.

Perhaps it wouldn't be a bad idea to add a prominent note to the docs somewhere about this situation, to save some other people a world of pain.

Heed my warnings!

sjmcdowall commented 6 years ago

Very interesting — and yes, I would think it would always be a pretty good to isolate such different work loads.

However, it does make me wonder if there isn’t some inherent performance issue with checkPresence itself as well. That behavior sounds odd (the high on the resources) under load…

On Mar 30, 2018, at 7:29 AM, Richard A notifications@github.com wrote:

I've just learned the hard way that it is NOT a good idea to mix your application data with Hydra's data in the default Redis database (db 0) if your services generate a lot of keys.

I've just spent an entire day trying to get to the bottom of very puzzling performance issues of a large system we launched a few days ago. As the number of keys in Redis kept climbing, the system got slower and slower until the point that it was unusable. After a lot of digging, it turned out that the SCAN MATCH command used by Hydra's checkPresence() function was the culprit. Apparently SCAN MATCH will iterate through the entire keyspace looking for a needle in a haystack. With a database of 100,000+ keys, that's 1000+ scan calls needed just to check service presence (with chunk size of 100). It easily took an R4.large Elasticache instance to its knees under moderate client load. Moving the app data out to a separate db immediately solved the issue.

Perhaps it wouldn't be a bad idea to add a prominent note to the docs somewhere about this situation, to save some other people a world of pain.

Heed my warnings!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flywheelsports/hydra/issues/192, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8M7TJNeug8ts-Pd4OM02n24yCMYIJ7ks5tjhcFgaJpZM4TBmHJ.

cjus commented 6 years ago

@rantecki thanks for your investigation and post! The issue you discovered is true of any Redis command that matches. keys is particularly bad in that context.

We learned of that issue early on and ALWAYS choose Redis DB number 15 to store hydra keys. Also, we're moving towards using a dedicated Redis cluster for Hydra and let our microservices use another Redis cluster if they need Redis functionality. This information will part of my RedisConf18 presentation and I'll also update our documentation to highlight this issue.

rantecki commented 6 years ago

Yes, well I guess my thought process was that Hydra's already doing all the hard work in connecting to Redis, so I'd just piggyback off that and use the same connection. It saved me some time when building the initial prototype.

Anyway, now that that's been sorted out the whole thing is running pretty smoothly now, so well done for building such a great platform.