praw-dev / praw

PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.
http://praw.readthedocs.io/
BSD 2-Clause "Simplified" License
3.4k stars 453 forks source link

Throw rate limit exception when wait time is greater than X seconds #1970

Closed pongstylin-rht closed 7 months ago

pongstylin-rht commented 10 months ago

Describe the solution you'd like

I would like a praw request to throw an exception if the rate limit wait time is more than X seconds. I would also like that exception to include the wait time so that I can reschedule another attempt at a later time.

My use case involves a queue of API calls to various API providers including reddit. Processing that queue is multi-threaded, but it is still a waste of time for a thread to sleep for 8 minutes when it could be processing other items in the queue. So, by having praw throw an exception with the wait time included, I can reschedule that item in the queue to be started when it is ready.

Describe alternatives you've considered

Not using praw to make requests or using a 3rd party API to get reddit data.

Additional context

There is already a "ratelimit_seconds" parameter in the reddit constructor, which confusingly doesn't give me what I want. It is only applied to rate limit errors in the JSON body of an API response. I believe this is called the "pause" time and only applies to certain "write" operations. My use case is strictly read access and rate limiting is communicated from the API via HTTP headers, which I believe is called "wait" time. So, you may choose to apply "ratelimit_seconds" to both contexts or add a 2nd parameter.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open for 30 days with no activity.

Remove the Stale label or comment or this will be closed in 30 days.

pongstylin-rht commented 9 months ago

Not stale bump. Still interested in a solution.

Watchful1 commented 9 months ago

I'm a bit curious in your use case here. PRAW automatically throttles your requests as you make them so you should never have a wait longer than a handful of seconds. How are you ending up with an 8 minute wait? You would have to bypass PRAW's rate limiter somehow and make lots of requests in a short time.

pongstylin-rht commented 9 months ago

Yes, PRAW does automatically throttle the requests. That's fine unless it gets up to several minutes of wait time - hence my request. As for why you may have never seen such long throttle times, perhaps you just don't hit the API frequently enough. My use case is user-driven. The more users requesting data, the more requests made of the API in a given timeframe. The queue I mentioned is a per-user queue. But there is no global queue that forces one user to wait for another user's request to complete. So, there can be a lot of concurrency.

There are several solutions to this problem, of course. I could place limits on concurrency by implementing a kind of global queue mentioned earlier. I could round-robin multiple sets of API credentials. And, we could manage long throttle times better, which is the subject of this thread. Perhaps some combination of solutions will ultimately be needed. But this would be a useful part.

Watchful1 commented 9 months ago

Are you authenticating or making anonymous requests? Are you creating new PRAW instances for each request? Even if you have multiple PRAW instances, when the rate limit is shared, each instance will independently read the rate limit headers and throttle its own requests, waiting several seconds each time so it doesn't get ahead of itself, specifically to prevent a multi-minute wait.

It's only if you make the request without any context on how many requests have been used, like by creating a new PRAW instance for each request, that you could not get that throttling and use up all your requests.

I'm curious about more details of your implementation. I specifically designed the rate limiting in PRAW to account for this type of case so you don't get the multi-minute waits you're seeing.

pongstylin-rht commented 9 months ago

Are you authenticating or making anonymous requests? All requests are authenticated via client_id and client_secret.

Are you creating new PRAW instances for each request? No. A global praw.Reddit is created for each thread and reused for multiple requests. Given that there are multiple queues and multiple threads per queue, I would estimate that 100 threads running at the same time would be a normal situation if a little busy. And by thread, I mean entirely separate running instances of the Python interpreter.

I'm curious about more details of your implementation. I'm not sure how curious you might be. But it is a solution built on top of Kubernetes. A user can submit a request for data on a given topic. This would trigger the creation of a data collection job in Kubernetes with a parallelism set to 10. That job is responsible for pulling data from multiple sources of which Reddit is just one. So, the first thing the job does is break down the work into tasks. Executing on a task might create more tasks. Each of the 10 workers / threads (a.k.a. docker containers) would run on a loop picking up a task from the job's queue and executing it before picking up another task from the queue. So, for example, a task might search Reddit for a list of submissions and then create a task for each of those submissions to collect comments. All 10 workers would share the load on executing those tasks. To get even more detailed, the submissions search task would store the submission data and the comments list task would retrieve that data and construct a new Submission object from the data for the purposes of calling the comments.list method. Once all tasks are done, the job is complete and all workers are shut down and any computing resources are released back to the cloud.

Generally, we're only using subreddit.search and submission.comments.list.

So, with that context, what am I experiencing? Well, a job might have 1000 tasks in the queue. But, all 10 workers were unlucky enough to pick up 10 reddit tasks that are throttled for 8 minutes. Now all 10 workers are sleeping when they could be processing other tasks to interrogate other providers (e.g. Twitter API). This makes the job take longer and the user has to wait longer for results. If the API call threw an exception with a datetime set to 8 minutes in the future, then I can send the task back to the queue with the "ready_at" field set 8 minutes in the future to ensure that a worker won't try to pick it up before then. We already use this solution for the Twitter API since it also returns headers indicating when we may send another API request. So it was low-hanging fruit to adopt a similar solution for Reddit before we consider more radical solutions.

Watchful1 commented 9 months ago

All requests are authenticated via client_id and client_secret.

I suspect this is the root of your issue. Authenticating requires password in addition to id and secret or the requests are still anonymous. Anonymous requests have a rate limit of 100 requests per 600 seconds, so your simultaneous requests are much more likely to exhaust the limit. Especially if you have, potentially, dozens of separate PRAW instances. If you include the password when authenticating you'll get the full 1000.

Also are you using PRAW or AsyncPRAW? With AsyncPRAW you can do an async await for your calls, but I'm fairly sure you still can't share it across threads.

Are you only calling submission.comments.list or are you also calling replace_more to fetch all comments in the thread? A single .list call should be fairly fast.

pongstylin-rht commented 9 months ago

Well, that was eye-opening. I'll look into the missing password issue and see if the wait times are still high. I expect they would be if the rate of concurrency is high enough, but any relief in rate limits is a very good thing.

We are not using async. Given my context, I'm not sure how to leverage it to suit my needs unless it exposes the wait time. I do understand why you bring it up though. Conceivably, a thread can start a 2nd task while the 1st task is pending. But in my context, a thread may only work on a single task at a time. Even if that were to change, I would need some means of knowing in a non-blocking way that an async operation has completed (async without await). I am curious if that is possible even if it seems infeasible to make use of it. Not having used async in python before, it seems to be possible: https://stackoverflow.com/questions/55709417/python-a-clean-way-to-check-if-a-task-has-finished

No, we are not using "replace_more" at this time (intentionally).

Watchful1 commented 9 months ago

If you're still having problems after adding the password, you can try installing this dev version of prawcore by doing

pip uninstall prawcore

and then

pip install -e git+https://github.com/praw-dev/prawcore#egg=prawcore

and then setting up praw logging. That will print out the rate limit headers with each request and help you figure out exactly why the requests are being used up without sleeping.

pongstylin-rht commented 9 months ago

Well, I'm having trouble importing the "praw" and "prawcore" packages when installing praw using the line you gave. But I already have logging setup and do see DEBUG lines talking about how long praw is sleeping before making an API call.

So, I ran a sample data collection job locally in my development environment with the password in place. I have not yet deployed this change to the production job cluster, so there should not be any outside use of these credentials. The job broke down the work into 434 tasks, which is roughly equivalent to how many times we make an API call. The job took about 6 minutes to complete using 10 threads. So that is less than 1000 API calls in a 15 minute window, which is well within the rate limit you mentioned. And yet, I see sleep times that top out at about 8 seconds. Why does it sleep?

Watchful1 commented 9 months ago

It would depend on how you're installing prawcore in your environment. The dev version has some slightly updated ratelimit code, which might make a different in your specific case. But also it has an updated log message that prints out the rate limit headers so you can calculate exactly why it's sleeping.

It slept because it doesn't know how many requests you're planning to use. It's trying to average your 1000 requests over the entire window, but you, presumably, do a bunch of requests and then stop before using them all. If you had more tasks and the whole thing went longer than 10 minutes, you would top out at sleeps of about 8 seconds, but never go much over that and wouldn't see the multi-minute sleeps you were seeing before. (the 600 second window is 10 minutes, not 15 minutes)

I have a spreadsheet here that I used to compare the old rate limit to the new implementation that's in the dev version. You can make a copy of it and edit the numbers in the boxes at the top to see how the graphs on the right change.

bboe commented 9 months ago

@pongstylin-rht I'd recommend having a separate worker queue with a single thread dedicated to only handling Reddit requests. There's little reason to have more than one worker issuing Reddit requests (especially with the recent API changes). Then you won't starve other types of work.

Realistically, I don't think we'll expose a way to estimate how long a request will sleep for because it is a very specific, uncommon, and workaroundable use case.

pongstylin-rht commented 9 months ago

A separate queue for each API is certainly something to consider. It would make it easier to implement proactive rate limiting where we can configure how often we make requests to each API. But in that context, I would still want to disable the sleep. No matter the queueing strategy, sleeping is a waste of resources. And if we can disable that sleep feature, then we can control throttling outside of praw. So, if you are unwilling to expose how long you would have slept for, then can we have a feature to disable sleeping entirely?

bboe commented 9 months ago

Given this more niche use-case, I would recommend manually replacing the RateLimiter class on the session objects. There isn't currently an easy way to accomplish that other than the following:

reddit = praw.Reddit(...)
rate_limiter = NoRateLimitLimiter()  # You'll need to define this class to adhere to this spec https://github.com/praw-dev/prawcore/blob/main/prawcore/rate_limit.py#L47
reddit._authorized_core._rate_limiter = rate_limiter
reddit._read_only_core._rate_limiter = rate_limiter
github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 30 days with no activity.

Remove the Stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been stale for 30 days with no activity.