mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

always 100 unique ids despite the size of returned comments #58

Open chaee opened 1 year ago

chaee commented 1 year ago

Hi! I am getting comments from the subreddit using before and after dates, but I found out that the number of unique items per day is always 100. The number of total result varies and seems right, but there are a lot of duplicates. The unique items are always 100 which is also the limit from reddit API, so I wonder if there's any connection here. Do I need to specify something in the query additionally? I tried adding size or limit but didn't seem to solve this problem (other than returning zero result when the limit is too big as others pointed out) Below is how I am sending the query now:

from pmaw import PushshiftAPI
api = PushshiftAPI()
api_request_generator = list(api.search_comments(subreddit='The_Donald',
                                                            before=calendar.timegm(until_date.timetuple()),
                                                            after=calendar.timegm(since_date.timetuple()),
                                                            safe_exit=True,
                                                            size=500,
                                                            mem_safe=True,
                                                            until=calendar.timegm(until_date.timetuple())
                                                         )
SoonBanned commented 1 year ago

Did you find any way to bypass this ? I have the same problem with submissions, I got 19k of submission but only 200 unique repeated in loop

SoonBanned commented 1 year ago

Oh I got it to work ! I checked this issue https://github.com/mattpodolak/pmaw/issues/57 and replace before and after by their new names and that did the trick :D