mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

Issue with limit? #57

Open ranbix666 opened 1 year ago

ranbix666 commented 1 year ago

Hi Matthew, thank you so much for your great work on PMAW!

I tried to use your example with a limit = 100000. It seems 0 comments will be retrieved if the limit is greater than 1000.

import datetime as dt
before = int(dt.datetime(2021,2,1,0,0).timestamp())
after = int(dt.datetime(2020,12,1,0,0).timestamp())

subreddit="wallstreetbets"
limit=100000
comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

The log:

WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
Retrieved 0 comments from Pushshift

I have tried with limit = 100, 1000, 1001. It seems 0 comments will be retrieved if the limit is greater than 1000.

Can you please let me know if I missed anything? Thanks!

eddvrs commented 1 year ago

Hi @ranbix666

The parameter names for before and after have changed to "until" and "since", so try this line instead:

    comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)

Additionally, the Pushshift API itself is undergoing a major migration, as a result there is not (yet) any data from before November 2022, so along with the above change, try changing the date range also.

The following code returns the expected count for me:

    api = pmaw.PushshiftAPI()

    before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
    after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())

    subreddit = "wallstreetbets"
    limit = 301
    comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
    print(f'Retrieved {len(comments)} comments from Pushshift')
AntLrm commented 1 year ago

Hello,

I have the same issue: request is ok if limit <= 1000. @eddvrs your example works because your limit is under 1000. This:

import pmaw
import datetime as dt
api = pmaw.PushshiftAPI()

before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())

subreddit = "wallstreetbets"
limit =1000
comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

returns Retrieved 1000 comments from Pushshift

While this (which is the exact same code but with a limit at 1001 instead of 1000):

import pmaw
import datetime as dt
api = pmaw.PushshiftAPI()
before = int(dt.datetime(2023, 1, 25, 0, 0).timestamp())
after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())
subreddit = "wallstreetbets"
limit =1001
comments = api.search_comments(subreddit=subreddit, limit=limit, until=before, since=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

returns Not all PushShift shards are active. Query results may be incomplete. Retrieved 0 comments from Pushshift

AntLrm commented 1 year ago

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

manu6287 commented 1 year ago

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

I set "size = 2000" and after about 15 minutes of runtime, I interrupted the process to find myself with over86000 results. Could someone please help?

FamiliarBreakfast commented 1 year ago

Size parameter is doesn't work right now

Adam-R26 commented 1 year ago

Was there ever any resolution to this problem? If both size and limit parameters aren't working as expected, how can we retrieve a desired number of records?

ranbix666 commented 1 year ago

Using the parameter "size" instead of "limit" fixed the issue for me. It is probably due to the pushshift migration.

I set "size = 2000" and after about 15 minutes of runtime, I interrupted the process to find myself with over86000 results. Could someone please help?

You get more than you asked for. Isn't it great? LOL, just joking.