mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

list index out of range #28

Closed XavierRigoulet closed 2 years ago

XavierRigoulet commented 2 years ago

Hello,

After running the script for some time, I get list index out of range. Here is a screenshot of the error: Screenshot 2021-10-07 150928

Do you know the cause of the error?

Thank you!

mattpodolak commented 2 years ago

Thanks for opening an issue @XavierRigoulet , can you include the minimum amount of code required to recreate this bug?

XavierRigoulet commented 2 years ago

Thank you for getting back to me. Sorry, here is the code:

from pmaw import PushshiftAPI

api = PushshiftAPI()

submissions = api.search_submissions(subreddit=['wallstreetbets'], limit=None, num_workers=10, mem_safe=True, safe_exit=True)

submission_list = [submission for submission in submissions]

The error seems only to happen when calling the search_submissions() method. It seems fine when calling search_comments() method...

sean-doody commented 2 years ago

I am also now getting this error, randomly, out of nowhere. Except, unlike @XavierRigoulet, it occurs when searching comments as well.

Here's my code:

import time
import sqlite3
import pandas as pd
from pmaw import PushshiftAPI

# Setup logging:
import sys
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))

# Main scraping function:
def main():
    # Timer:
    start = time.time()

    before = int(time.time())

    DATA = "SQL_DATABASE.db"
    tables = ["comments", "posts"]
    subreddit = "SUBREDDIT"

    #Initialize API:
    api = PushshiftAPI()

    for table in tables:
        if table == "comments":
            # Desired data fields for comments:
            fields = ['id', 'permalink', 'body', 'author', 'distinguished', 
                    'author_flair_text', 'created_utc', 'subreddit', 'subreddit_id', 
                    'link_id', 'parent_id', 'score', 'retrieved_on', 'stickied']

            threads = 4

            # Get start date:
            conn = sqlite3.connect(DATA)
            after = pd.read_sql("SELECT max(created_utc) AS max_date FROM comments", conn)
            after = int(after["max_date"][0])

            results = api.search_comments(subreddit=subreddit, 
                                            fields=fields,
                                            before=before,
                                            after=after, 
                                            safe_exit=True, 
                                            workers=threads)

            print('Getting all JSON')
            comments = [c for c in results]

            print('Creating dataframe')
            df = pd.DataFrame(comments)

            print(f"Saving {table} data")
            conn = sqlite3.connect(DATA)
            df.to_sql(table, conn, if_exists="append")
            conn.close()

            print(f'Finished {table}')

        elif table == "posts":

            fields = ["id", "author", "title", "selftext", "score", "upvote_ratio", "total_awards_received", "stickied", "pinned", "num_comments", "num_crossposts",
            "subreddit", "subreddit_id", "author_flair_text", "author_fullname", "author_premium", "created_utc", "retrieved_on", "domain", "permalink", "full_link", 
            "url", "is_meta", "is_original_content", "is_reddit_media_domain", "is_self", "locked", "media_only", "over_18", "removed_by_category"]

            threads = 1

            # Get start date:
            conn = sqlite3.connect(DATA)
            after = int(pd.read_sql("SELECT MAX(created_utc) AS after_date FROM posts", conn)["after_date"][0])
            conn.close()

            results = api.search_submissions(subreddit=subreddit,
                                            fields=fields,
                                            before=before,
                                            after=after, 
                                            safe_exit=True, 
                                            workers=threads)

            print('Getting all JSON')
            posts = [c for c in results]

            print('Creating dataframe')
            df = pd.DataFrame(posts)

            print(f"Saving {table} data")
            conn = sqlite3.connect(DATA)
            df.to_sql("posts", conn, if_exists="append")
            conn.close()

            print(f'Finished {table}')

    # End timer:
    end = time.time()
    print(f'Finished program in {round((end-start)/60, 3)} minutes')

if __name__ == '__main__':
    main()
pavkriz commented 2 years ago

Same error here. I do one api.search_comments for every day (shifting before and after by 24 hours among calls) and it works fine for some days (before+after timerange) and always fails for another particular day (timerange). So it's probably not random, rather systematic.

mattpodolak commented 2 years ago

Thanks for the updates! I'll be pushing out a fix for this later this week

pavkriz commented 2 years ago

Adding DEBUG log:

2021-11-29 11:59:15,379 - pmaw.PushshiftAPIBase - INFO - 47 result(s) available in Pushshift
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 42
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637992800 - 1637993160 returned 5 results
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:17,513 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 38
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993160 - 1637993520 returned 4 results
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 33
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993520 - 1637993880 returned 5 results
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 29
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993880 - 1637994240 returned 4 results
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 22
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994600 - 1637994960 returned 7 results
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 17
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994240 - 1637994600 returned 5 results
2021-11-29 11:59:21,802 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 10
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994960 - 1637995320 returned 7 results
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 6
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995320 - 1637995680 returned 4 results
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 3
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637996040 - 1637996400 returned 3 results
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - 3 total results for this time slice
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637996040 returned 1 results
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - 2 total results for this time slice
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637995779 returned 0 results
2021-11-29 11:59:27,811 - pmaw.PushshiftAPIBase - DEBUG - 1 total results for this time slice

then the error is raised because remaining is > 0 but len(data) is 0, so there is no record to get created_utc from.

pavkriz commented 2 years ago

My dirty "hotfix" (probably could be done better) in PushshiftAPIBase._futures_handler:

                        if num > 0:
                            # find minimum `created_utc` to set as the `before` parameter in next timeslices
                            if remaining == 1 and len(data) == 0:
                                log.warning('Remaining 1 records to fetch but 0 data returned now, ignoring')
                            else:
                                before = data[-1]['created_utc']

                                # generate payloads
                                self.req.gen_slices(
                                    url, payload, after, before, num)
mattpodolak commented 2 years ago

Thanks @pavkriz, I implemented a modified version of your hotfix!

pavkriz commented 2 years ago

32 fix does not work well. In some situation, it stucks in an infinite loop repeating the last request (returning no data):

INFO:pmaw.PushshiftAPIBase:51028 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 100 - Batches: 10 - Items Remaining: 41030
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 200 - Batches: 20 - Items Remaining: 31039
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 300 - Batches: 30 - Items Remaining: 21094
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 400 - Batches: 40 - Items Remaining: 13231
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 500 - Batches: 50 - Items Remaining: 4676
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 600 - Batches: 60 - Items Remaining: 276
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 615 - Batches: 70 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 625 - Batches: 80 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 635 - Batches: 90 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 645 - Batches: 100 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 655 - Batches: 110 - Items Remaining: 266
...

This hotfix works for me:

                        if num > 0:
                            # find minimum `created_utc` to set as the `before` parameter in next timeslices
                            if len(data) > 0:
                                # make sure that index error wont occur
                                # we want to slice using the payload['before'] if we dont get any results
                                before = data[-1]['created_utc']

                                # generate payloads
                                self.req.gen_slices(
                                    url, payload, after, before, num)
                            else:
                                log.warning('Remaining some records to fetch but 0 data returned now, ignoring')
mattpodolak commented 2 years ago

@pavkriz just added to v2.1.2 :D