Async & Redis problem - Githubissues

insspb commented 3 years ago

I faced problem when used pyratelimiter with async + redis + celery multiworkers in same docker compose (Scary? :)

Sometime pyrate limiter passed not to own protection (raise_or_delay), but to final protection of API requests (when real 429 received). First I thought it related to monotonic clock. But after reading pyton PEP 418 it become clear, that in one LINUX instance clocks are the same.

So I continued research and at least one problem is found: Redis use regular list, instead of sorted set. In this case with async approach and delays we receive this:

I am continuing research and maybe will introduce some PR. But at least I wanted to share this info for future reference\investigation.

vutran1710 commented 3 years ago

what you are facing seems related to issue #53

apparently there are some problems with distributed system regarding timing consensus between clocks. ill get to it once i find some spare time.

insspb commented 3 years ago

@vutran1710 #53 is only part of problem.

I have an idea how to fix it, but currently I implemented only #53 in #56 as it will be required to complete fix of this issue.

vutran1710 commented 3 years ago

Interesting, I would love to hear about it from you.

On the other hand, today I spent some time thinking about it. Apparently this problem of clock synchronization has something to do with NTP. The point is, to ensure correctness of timing between multiple nodes in a cluster or a distributed system, the time has to come from some sort of Single source of truth - and all the working nodes will have to rely on it instead of their local clock.

In our particular case with Redis, I would say, following the official document (https://docs.redis.com/latest/rs/administering/designing-production/synchronizing-clocks/), it seems the proper solution would be utilize Redis TIME method. So, one of the solutions I've been thinking about is to implement the limiter logic directly on Redis itself with Lua script.

insspb commented 3 years ago

@vutran1710

I am thinking about lua too as an option.

Currently I am locally playing with just RedisBucket extension:

    def get_time(self):
        conn = self.get_connection()
        seconds, microseconds = conn.time()
        return seconds + microseconds / 1_000_000

This allow to use redis server time as time source.

vutran1710 commented 3 years ago

@vutran1710

I am thinking about lua too as an option.

Currently I am locally playing with just RedisBucket extension:
    def get_time(self):
        conn = self.get_connection()
        seconds, microseconds = conn.time()
        return seconds + microseconds / 1_000_000
This allow to use redis server time as time source.

Yes, but I'm concerned that this approach can be problematic and might not work if the network latency is unstable.

insspb commented 3 years ago

@vutran1710 That's why I do not include it to current PR. Thinking on this too.

insspb commented 3 years ago

@vutran1710 Today's runs show that #56 + mentioned time approach is enough for my purposes with 300 req/min and will work for me. I just set 290 for own protection.

For high frequency limitation mine tests shows some passes of unexpected packets. I.e. if we set 5/sec (is the same as 300/min but with different burst) we will see 6/7 request pass if we use 2\3\4 process. Never cached with single process run.

medihack commented 2 years ago

I want to use this great rate-limiting library with Dagster and the DagsterRunLauncher which starts every task in its own Docker container. Could this issue here also occur when all containers run on the same host?

Regarding the Lua script, I accidentally found plsmphnx/redis-bucket, a rate limiter for Node.js, which also implements the leaky buckets algorithm by using a Lua script. Maybe this helps.

vutran1710 commented 2 years ago

@medihack The thing about rate limiting is, the calculation should be correct in most situations. In CS sense, I believe it belongs to Consistency category of CAP Theorem. And for that, it becomes very hard to maintain availibility without compromising the correctness when you try to apply to a distributed environment, ie the latency between nodes - either it is redis or python-based services - will badly affect the rate-check result.

So my 2 cents for this is, if you want to have rate-limiting, you may want to come up with a system design where all rate limit checks are done by a single instance of machine only, either a redis with luascript or a request proxy server written in python that uses this library, with minimal numbers of external data sources.

I don't know enough about your usecase, but I think thats how I would approach the problem - for now.

vutran1710 commented 2 years ago

@medihack if all containers run on the same host, you wont likely have problem with latency I think. but still, some experiments should be done to confirm that.

medihack commented 2 years ago

@vutran1710 Thanks for the information and the excellent library. I will experiment a bit and report back if any problems occur.

vutran1710 commented 1 year ago

Close as it is outdated

vutran1710 / PyrateLimiter

Async & Redis problem #55