Closed XavierRigoulet closed 2 years ago
Thanks for opening an issue @XavierRigoulet , can you include the minimum amount of code required to recreate this bug?
Thank you for getting back to me. Sorry, here is the code:
from pmaw import PushshiftAPI
api = PushshiftAPI()
submissions = api.search_submissions(subreddit=['wallstreetbets'], limit=None, num_workers=10, mem_safe=True, safe_exit=True)
submission_list = [submission for submission in submissions]
The error seems only to happen when calling the search_submissions() method. It seems fine when calling search_comments() method...
I am also now getting this error, randomly, out of nowhere. Except, unlike @XavierRigoulet, it occurs when searching comments as well.
Here's my code:
import time
import sqlite3
import pandas as pd
from pmaw import PushshiftAPI
# Setup logging:
import sys
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
# Main scraping function:
def main():
# Timer:
start = time.time()
before = int(time.time())
DATA = "SQL_DATABASE.db"
tables = ["comments", "posts"]
subreddit = "SUBREDDIT"
#Initialize API:
api = PushshiftAPI()
for table in tables:
if table == "comments":
# Desired data fields for comments:
fields = ['id', 'permalink', 'body', 'author', 'distinguished',
'author_flair_text', 'created_utc', 'subreddit', 'subreddit_id',
'link_id', 'parent_id', 'score', 'retrieved_on', 'stickied']
threads = 4
# Get start date:
conn = sqlite3.connect(DATA)
after = pd.read_sql("SELECT max(created_utc) AS max_date FROM comments", conn)
after = int(after["max_date"][0])
results = api.search_comments(subreddit=subreddit,
fields=fields,
before=before,
after=after,
safe_exit=True,
workers=threads)
print('Getting all JSON')
comments = [c for c in results]
print('Creating dataframe')
df = pd.DataFrame(comments)
print(f"Saving {table} data")
conn = sqlite3.connect(DATA)
df.to_sql(table, conn, if_exists="append")
conn.close()
print(f'Finished {table}')
elif table == "posts":
fields = ["id", "author", "title", "selftext", "score", "upvote_ratio", "total_awards_received", "stickied", "pinned", "num_comments", "num_crossposts",
"subreddit", "subreddit_id", "author_flair_text", "author_fullname", "author_premium", "created_utc", "retrieved_on", "domain", "permalink", "full_link",
"url", "is_meta", "is_original_content", "is_reddit_media_domain", "is_self", "locked", "media_only", "over_18", "removed_by_category"]
threads = 1
# Get start date:
conn = sqlite3.connect(DATA)
after = int(pd.read_sql("SELECT MAX(created_utc) AS after_date FROM posts", conn)["after_date"][0])
conn.close()
results = api.search_submissions(subreddit=subreddit,
fields=fields,
before=before,
after=after,
safe_exit=True,
workers=threads)
print('Getting all JSON')
posts = [c for c in results]
print('Creating dataframe')
df = pd.DataFrame(posts)
print(f"Saving {table} data")
conn = sqlite3.connect(DATA)
df.to_sql("posts", conn, if_exists="append")
conn.close()
print(f'Finished {table}')
# End timer:
end = time.time()
print(f'Finished program in {round((end-start)/60, 3)} minutes')
if __name__ == '__main__':
main()
Same error here. I do one api.search_comments
for every day (shifting before
and after
by 24 hours among calls) and it works fine for some days (before
+after
timerange) and always fails for another particular day (timerange). So it's probably not random, rather systematic.
Thanks for the updates! I'll be pushing out a fix for this later this week
Adding DEBUG log:
2021-11-29 11:59:15,379 - pmaw.PushshiftAPIBase - INFO - 47 result(s) available in Pushshift
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 42
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637992800 - 1637993160 returned 5 results
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:17,513 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 38
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993160 - 1637993520 returned 4 results
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 33
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993520 - 1637993880 returned 5 results
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 29
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993880 - 1637994240 returned 4 results
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 22
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994600 - 1637994960 returned 7 results
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 17
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994240 - 1637994600 returned 5 results
2021-11-29 11:59:21,802 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 10
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994960 - 1637995320 returned 7 results
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 6
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995320 - 1637995680 returned 4 results
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 3
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637996040 - 1637996400 returned 3 results
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - 3 total results for this time slice
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637996040 returned 1 results
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - 2 total results for this time slice
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637995779 returned 0 results
2021-11-29 11:59:27,811 - pmaw.PushshiftAPIBase - DEBUG - 1 total results for this time slice
then the error is raised because remaining
is > 0 but len(data)
is 0, so there is no record to get created_utc
from.
My dirty "hotfix" (probably could be done better) in PushshiftAPIBase._futures_handler
:
if num > 0:
# find minimum `created_utc` to set as the `before` parameter in next timeslices
if remaining == 1 and len(data) == 0:
log.warning('Remaining 1 records to fetch but 0 data returned now, ignoring')
else:
before = data[-1]['created_utc']
# generate payloads
self.req.gen_slices(
url, payload, after, before, num)
Thanks @pavkriz, I implemented a modified version of your hotfix!
INFO:pmaw.PushshiftAPIBase:51028 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 100 - Batches: 10 - Items Remaining: 41030
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 200 - Batches: 20 - Items Remaining: 31039
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 300 - Batches: 30 - Items Remaining: 21094
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 400 - Batches: 40 - Items Remaining: 13231
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 500 - Batches: 50 - Items Remaining: 4676
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 600 - Batches: 60 - Items Remaining: 276
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 615 - Batches: 70 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 625 - Batches: 80 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 635 - Batches: 90 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 645 - Batches: 100 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 655 - Batches: 110 - Items Remaining: 266
...
This hotfix works for me:
if num > 0:
# find minimum `created_utc` to set as the `before` parameter in next timeslices
if len(data) > 0:
# make sure that index error wont occur
# we want to slice using the payload['before'] if we dont get any results
before = data[-1]['created_utc']
# generate payloads
self.req.gen_slices(
url, payload, after, before, num)
else:
log.warning('Remaining some records to fetch but 0 data returned now, ignoring')
@pavkriz just added to v2.1.2 :D
Hello,
After running the script for some time, I get list index out of range. Here is a screenshot of the error:
Do you know the cause of the error?
Thank you!