seomoz / pyreBloom

Fast Redis Bloom Filters in Python
MIT License
288 stars 67 forks source link

Connection pooling? #8

Open skyshard opened 10 years ago

skyshard commented 10 years ago

Is there a good way to share redis connections across different instances of pyreBloom?

I'm currently using a rotating pool of filters to implement a sort of TTL- newly seen urls get added to the current filter, but all filters are checked for membership. At some set interval (hour/day/etc) the oldest filter gets cleared out and reused for the current one.

This works out pretty well, except it uses up lots of connections. Is there a good way to reuse connections between filters or specify the key name to check? Should I take an entirely different approach to expiring old urls?

dlecocq commented 10 years ago

There's not currently a way to pool connections between filters :-/ That said, what you're doing to implement expiring bloom filters is exactly how we do it and is commonly how it's done elsewhere.

How many filters do you have at any one time?

skyshard commented 10 years ago

Ah, it's good to hear that you do it the same way. I only have 7 filters at once, but this is multiplied by each celery worker making its own connections to the filters- this ends up being a few thousand connections in practice. I'll probably try sharing across different processes

dlecocq commented 10 years ago

Is the number of connections problematic at the redis server level? It uses epoll/kqueue as available, so the number of connections shouldn't be an issue on that front. If you're hitting limits, there are both redis-level limits (maxclients) and ulimit open file descriptor limits, and they can be bumped substantially.

Assuming it's not actual networking overhead causing the heartache, at the end of the day, all your celery workers are interacting with this single shared resource. It seems likely that eventually redis' performance may become an issue.

For some context, there are a few projects for which we use pyreBloom (in fact, for URL deduping, too). One of them uses 2 modest machines with 4 redis-server instances each, and processes tens of millions of URLs per day using about 1% CPU average. The other uses 4 m2.xlarges with 4 redis-server instances each to sift through hundreds of millions of URLs per day using about 10% CPU average.

skyshard commented 10 years ago

Those are good numbers for reference, thanks! Are you partitioning across the different redis-server instances on the same ec2 instances for those performance reasons/did you find that to be better than using individual redis-server instances on each box?

I'm currently at around 45k reads per second without pipelining (on a hosted solution actually, on what appears to be m2.2xlarges) and was somewhat concerned about the number of open connections, but judging by your experiences it shouldn't be a big issue (except for hosted plans with connection limits). Thanks for all the advice!

dlecocq commented 10 years ago

We treat the servers as just host:port pairs from the client side, but we have multiple redis-server processes on each. The reason is just that redis-server is single-core (with the exception of its backups).

It may also help to give some context about the bloom capacities. IIRC, we generally have a capacity of about 1e9 for each month partition, and it uses I think 7 or so hashes for each filter.

skyshard commented 10 years ago

For what it's worth, apparently there is performance degradation with high connection counts:

Requests/second vs # of open connections, from http://redis.io/topics/benchmarks

As a rule of thumb, an instance with 30000 connections can only process half the throughput achievable with 100 connections. benchmark graph from redis docs