A little (multithreaded asynchronous) API wrapper for PullPush.io - the 3rd party replacement API for Reddit.
After the 2023 Reddit API controversy,
PushShift.io(and also wrappers such as PSAW and PMAW) is now only available to reddit admins and Reddit PRAW is
honestly useless when trying to get a lots of data and data from a specific timeframe.
PullPush.io thankfully solves this issue and this is the wrapper for that said API. For more info on the
API(TOS, Forum, Docs, etc.) go to PullPush.io.
BAScraper(Blue Archive Scraper) was initially made and used for the 2023 recap/wrap up of that sub, hence the name. It's pretty basic but planning to add some more features as it goes. It uses multithreading to make requests to the PullPush.io endpoint and returns the result as a JSON(dict) object.
currently it can:
Also, please ask the PullPush.io owner before making large amounts or request and also respect cool-down times. It stresses the server and can cause inconvenience for everyone.
[!NOTE] As of Feb. 2024, PullPush API implemented ratelimiting!
soft limit will occur after 15 req/min and hard limit after 30 req/min. There's also a long-term (hard) limit of 1000 req/hr.
Recommended request pacing:
- to prevent soft-limit: 4 sec sleep per request
- to prevent hard-limit: 2 sec sleep per request
- for 1000+ requests: 3.6 ~ 4 sec sleep per request
rate limiting will automatically pace your request's response time to meet the following hard limits. But
pace_mode
would still do cooldowns just in case. Following the pacing time above is recommended.[!WARNING] The long-term hard ratelimit of 1000 req/hr is not implemented in the auto ratelimit mitigation cooldowns. You should manually set sleep second using the
sleepsec
param forPullPushAsync.__init__
following the above guidelines until it's implemented.
you can install the package via pip
pip install BAScraper
Python 3.11+ is needed (asyncio.TaskGroup
is used)
Example usage
from datetime import datetime, timedelta
from BAScraper.BAScraper_async import PullPushAsync
import asyncio
# `log_stream_level` can be one of DEBUG, INFO, WARNING, ERROR
ppa = PullPushAsync(log_stream_level="INFO")
# basic fetching
result1 = asyncio.run(ppa.get_submissions(subreddit='bluearchive',
after=datetime.timestamp(
datetime(2024, 7, 1)),
before=datetime.timestamp(
datetime(2024, 7, 8)),
file_name='result1'
))
# basic fetching with comments
result2 = asyncio.run(ppa.get_submissions(subreddit='bluearchive',
after=datetime.timestamp(
datetime.now() - timedelta(hours=6)),
file_name='result2', get_comments=True
))
# basic comment fetching
result3 = asyncio.run(ppa.get_comments(subreddit='bluearchive',
after=datetime.timestamp(
datetime.now() - timedelta(hours=6)),
file_name='result3'
))
# all results are saved to 'resultX.json' since the `file_name` field was specified.
# it'll save all the results in the current directory since `save_dir` wasn't specified
[!NOTE] When using multiple requests, (as in multiple functions under
PullPushAsync
) it is highly recommended to use all the functions under the same class because all the request pool related variables would be shared in that case.Also, when re-running scripts using this, pools recording the request status is reset every time. So keep in mind that unexpected soft/hard rate limits may occur when frequently (re-)running scripts. Consider waiting a few minutes or seconds before running scripts if needed.
[!WARNING] One possible problem when using filters is the premature termination of request chains. (none reported yet) It's due to the logic of determining when to end requests which was not expected. If requests are ending earlier than expected or only certain date segments are returned, consider removing filters or search restrictions. After that, filter them after fetching all the results.
PullPushAsync.__init__
all parameters are optional
parameter | type | description | default value |
---|---|---|---|
sleepsec | int |
cooldown time between each request | 1 |
backoffsec | int |
backoff time for each failed request | 3 |
max_retries | int |
number of retries for failed requests before it gives up | 5 |
timeout | int |
time until it's considered as timout err | 10 |
pace_mode | str |
one of 'auto-soft', 'auto-hard', 'manual'. sets the pace to mitigate rate-limiting. ('auto-soft' and 'auto-hard' don't have any difference for now. still recommended to use 'auto-hard') | 'auto-hard' |
save_dir | str |
directory to save the results, defaults to current directory | os.getcwd() (current directory) |
task_num | int |
number of async tasks to be made | 3 |
log_stream_level | str |
sets the log level for logs streamed on the terminal | 'INFO' |
log_level | str |
sets the log level for logging (file) | 'DEBUG' |
duplicate_action | str |
one of 'keep_newest', 'keep_oldest', 'remove', 'keep_original', 'keep_removed'. decides what to do with duplicate entries (usually caused by deletion) | 'keep_newest' |
PullPushAsync.get_submissions
& PullPushAsync.get_comments
All parameters are optional, please write all parameters as keyword-arguments(kwargs) as there are no set order for the parameters.
These functions will return a dict
object
parameter | type | description | default value | get_submissions | get_comments |
---|---|---|---|---|---|
file_name | str |
file name to use for the saves json result. If None , doesn't save the file. |
None |
✅ | ✅ |
get_comments | bool |
If true, the result will contain the comments field where all the comments for that post will be contained(List[dict] ) |
False |
✅ | |
after | datetime.datetime |
Return results after this date (inclusive >=) | ✅ | ✅ | |
before | datetime.datetime |
Return results before this date (exclusive <) | ✅ | ✅ | |
filters | List[str] |
filters result to only get the fields you want | ✅ | ✅ | |
sort | str |
Sort results in a specific order accepts: 'desc', 'asc | desc | ✅ | ✅ |
sort_type | str |
Sort by a specific attribute. If after and before is used, defaults to 'created_utc' accepts: 'created_utc', 'score', 'num_comments' |
created_utc | ✅ | ✅ |
limit | int |
Number of results to return per request. Maximum value of 100, recommended to keep at default | 100 | ✅ | ✅ |
ids | List[str] |
Get specific submissions via their ids | ✅ | ✅ | |
link_id | str |
Return results from a particular submission | ✅ | ||
q | str |
Search term. Will search ALL possible fields | ✅ | ✅ | |
title | str |
Searches the title field only | ✅ | ||
selftext | str |
Searches the selftext field only | ✅ | ||
author | str |
Restrict to a specific author | ✅ | ✅ | |
subreddit | str |
Restrict to a specific subreddit | ✅ | ✅ | |
score | int |
Restrict results based on score | ✅ | ||
num_comments | int |
Restrict results based on number of comments | ✅ | ||
over_18 | bool |
Restrict to nsfw or sfw content | ✅ | ||
is_video | bool |
Restrict to video content <as of writing this parameter is broken (err 500 will be returned)> | ✅ | ||
locked | bool |
Return locked or unlocked threads only | ✅ | ||
stickied | bool |
Return stickied or un-stickied content only | ✅ | ||
spoiler | bool |
Exclude or include spoilers only | ✅ | ||
contest_mode | bool |
Exclude or include content mode submissions | ✅ |
the PullPushAsync.get_submissions
& PullPushAsync.get_comments
each returns a dict
object that is indexed based on its unique reddit submission/comment ID.
It is sorted in the order you specified when scraping
(the sort
parameter).
So the general structure looks like this (regardless of it being a submission or a comment):
{
"21jh54" : {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_culcgvve",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
...
},
"54jp5i" : {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_kdbbiwo",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
...
},
...
}
if the get_comments
parameter is set to True
the returned result would look like this (for submissions)
{
"21jh54": {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_culcgvve",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
"comments": [
{
...
info related to comments
...
},
...
],
...
},
...
}