mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

IndexError: list index out of range #35

Closed ReichYang closed 2 years ago

ReichYang commented 2 years ago

Hi, there. I'm running into this error for the scraping for submissions. Could you let me know why and how can I get pass it?

subs=['depression','anxiety','suicidewatch']
for sub in subs::
    posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
    print(f'{len(posts)} posts retrieved from Pushshift')
    post_list = [post for post in posts]
    pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-61805bcbf485> in <module>
      1 for sub in subs:
----> 2     posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
      3     print(f'{len(posts)} posts retrieved from Pushshift')
      4     post_list = [post for post in posts]
      5     pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPI.py in search_submissions(self, **kwargs)
     72             Response generator object
     73         """
---> 74         return self._search(kind='submission', **kwargs)

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _search(self, kind, max_ids_per_request, max_results_per_request, mem_safe, search_window, dataset, safe_exit, cache_dir, filter_fn, **kwargs)
    261             self.req.gen_url_payloads(
    262                 url, self.batch_size, search_window)
--> 263 
    264             # check for exit signals
    265             self.req.check_sigs()

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _multithread(self, check_total)
     98 
     99                 futures = {executor.submit(
--> 100                     self._get, url_pay[0], url_pay[1]): url_pay for url_pay in reqs}
    101 
    102                 self._futures_handler(futures, check_total)

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _futures_handler(self, futures, check_total)
    166                             num = 0
    167 
--> 168                         if num > 0:
    169                             # find minimum `created_utc` to set as the `before` parameter in next timeslices
    170                             if len(data) > 0:

IndexError: list index out of range
mattpodolak commented 2 years ago

Hi, what version of pmaw are you using? I was working on fixing this issue in the latest release

ReichYang commented 2 years ago

@mattpodolak Hi, I'm using 2.00. Also, I encountered a memory error even I have mem_safe as True.

MemoryError
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-14-61805bcbf485> in <module>
      2     posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
      3     print(f'{len(posts)} posts retrieved from Pushshift')
----> 4     post_list = [post for post in posts]
      5     pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")

<ipython-input-14-61805bcbf485> in <listcomp>(.0)
      2     posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
      3     print(f'{len(posts)} posts retrieved from Pushshift')
----> 4     post_list = [post for post in posts]
      5     pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")

D:\Python37\lib\_collections_abc.py in __next__(self)
    315         When exhausted, raise StopIteration.
    316         """
--> 317         return self.send(None)
    318 
    319     @abstractmethod

~\AppData\Roaming\Python\Python37\site-packages\pmaw\Response.py in send(self, ignored_arg)
     30             Response generator object
     31         """
---> 32         cache = Cache.load_with_key(key, cache_dir)
     33         return Response(cache)
     34 

~\AppData\Roaming\Python\Python37\site-packages\pmaw\Cache.py in load_resp(self, cache_num)
     56             with gzip.open(f'{self.folder}/{self.key}_info.pickle.gz', 'rb') as handle:
     57                 return pickle.load(handle)
---> 58         except FileNotFoundError:
     59             log.info('No previous requests to load')
     60             return None

MemoryError: 
mattpodolak commented 2 years ago

Can you try updating to the latest version? This will solve the issue with the index error.

Enabling memory safety means that pmaw wont trigger a memory error during retrieval as the results will be stored in a cache on disk.

The memory issue will arise if you try to iterate through every result at one time.

I would recommend iterating through the generator in batches to solve this.