mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

api call seems to return nothing #49

Closed moonrabbitt closed 1 year ago

moonrabbitt commented 1 year ago

Hi, I copied the code below from the example page, it used to work but have stopped working recently, returning an empty list.

from pmaw import PushshiftAPI

api = PushshiftAPI()
comments = api.search_comments(subreddit="science", limit=1000)
comment_list = [comment for comment in comments]
comment_list

returns

[]

search_submissions also returns nothing....

Thank you so much.

jaspark-ea commented 1 year ago

I am seeing the same thing. Nothing returned.

MatchaOnMuffins commented 1 year ago

Same here, nothing is being returned. However, the Pushshift API itself is working

Sellitus commented 1 year ago

Same here, returning nothing despite the parameters being used

Dominyk4s commented 1 year ago

Same here; Pushshift API is working, just pmaw wrapper does not. An example with results from pushshift without any wrapper (and empty results from pmaw):

import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt
import requests
import time

start_date = dt.date(2022, 12, 1)
end_date = dt.date(2022, 12, 15)

start_date = dt.datetime.fromordinal(start_date.toordinal())
end_date = dt.datetime.fromordinal(end_date.toordinal())

api = PushshiftAPI()

start_epoch = int(start_date.timestamp())
end_epoch = int(end_date.timestamp())

submissions = api.search_submissions(subreddit='politics', q='biden', after=start_epoch,
                                     before=end_epoch, num_workers=20)

sub_df = pd.DataFrame(submissions)
print('---------------------------------------------------------------')
print(f'pmaw df size: {sub_df.shape}')
print(sub_df.head())

time.sleep(10)
# Pushsift api directily
api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
            + '&after=' + str(start_epoch) \
            + '&before=' + str(end_epoch) \
            + '&subreddit=' + 'politics' \
            + '&limit=' + str(100)

r = requests.get(api_query)
json = r.json()
df_pushshift = pd.DataFrame(json['data'])

print('---------------------------------------------------------------')
print(f'Pushshift direct df size: {df_pushshift.shape}')
print(df_pushshift.head())
print('---------------------------------------------------------------')

Results:

---------------------------------------------------------------
pmaw df size: (0, 0)
Empty DataFrame
Columns: []
Index: []
---------------------------------------------------------------
Pushshift direct df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------
Dominyk4s commented 1 year ago

Just a quick temp fix:

I've added some changes to the code (see it here), using it two new parameters exist for calling Pushshift: sort_var='order' and check_totals=False.

Now it works with recent data only (older data cannot be queried even using Pushshift directly either yet).

Code example (with sort_var='order' and check_totals=False on calling it):

import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt
import requests
import time

start_date = dt.date(2022, 12, 1)
end_date = dt.date(2022, 12, 15)

start_date = dt.datetime.fromordinal(start_date.toordinal())
end_date = dt.datetime.fromordinal(end_date.toordinal())

api = PushshiftAPI()

start_epoch = int(start_date.timestamp())
end_epoch = int(end_date.timestamp())

submissions = api.search_submissions(subreddit='politics', q='biden', after=start_epoch,
                                     before=end_epoch, limit=100, sort_var='order', check_totals=False, praw=True,
                                     num_workers=1)

sub_df = pd.DataFrame(submissions)
print('---------------------------------------------------------------')
print(f'pmaw df size: {sub_df.shape}')
print(sub_df.head())

time.sleep(10)
# Pushsift api directily
api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
            + '&after=' + str(start_epoch) \
            + '&before=' + str(end_epoch) \
            + '&subreddit=' + 'politics' \
            + '&limit=' + str(100)

r = requests.get(api_query)
json = r.json()
df_pushshift = pd.DataFrame(json['data'])

print('---------------------------------------------------------------')
print(f'Pushshift direct df size: {df_pushshift.shape}')
print(df_pushshift.head())
print('---------------------------------------------------------------')

Results:

---------------------------------------------------------------
pmaw df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------
Pushshift direct df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------

Process finished with exit code 0
Security-Chief-Odo commented 1 year ago

Thanks for showing those changes. For some reason, implementing those makes PMAW very slow for responses. What used to be done in ~ 30 seconds, after these changes is taking 5+ minutes.

Start 2022-12-18 15:33:13

resPosts = api.search_submissions(since=start_epoch, subreddit=<sub>, author=user, limit=10, check_totals=False)
resComments = api.search_comments(since=start_epoch, subreddit=<sub>, author=user, limit=25, check_totals=False)

End 2022-12-18 15:38:45
qosmio commented 1 year ago

@Dominyk4s

> # Pushsift api directily
> api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
>             + '&after=' + str(start_epoch) \
>             + '&before=' + str(end_epoch) \
>             + '&subreddit=' + 'politics' \
>             + '&limit=' + str(100)
> 
> 

Just a heads up, before and after are deprecated. Not sure how long that will work.

after | string (After) Search after this epoch time (inclusive) -- deprecated (use since)

before | string (Before) Search before this epoch time (exclusive) -- deprecated (use until)
Sellitus commented 1 year ago

@Dominyk4s

> # Pushsift api directily
> api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
>             + '&after=' + str(start_epoch) \
>             + '&before=' + str(end_epoch) \
>             + '&subreddit=' + 'politics' \
>             + '&limit=' + str(100)
> 
> 

Just a heads up, before and after are deprecated. Not sure how long that will work.

after | string (After) Search after this epoch time (inclusive) -- deprecated (use since)

before | string (Before) Search before this epoch time (exclusive) -- deprecated (use until)

Wow, no one is going to use the push shift API after those go away lol

eddvrs commented 1 year ago

Before/after have changed names to until/since. Some of the param nmes fr sorting have changed also.

I've addressed the before/after => since/until change, as well as the sort/sort_type, and some other changes in my PR for PMAW, because at the moment it's giving 0 results!

mattpodolak commented 1 year ago

this will be fixed in v3.0.0 after #52 is merged + released

Arobnett commented 1 year ago

There's no working alternative version in the meantime?

YS-SHI-93 commented 1 year ago

I also encountered this issue.

Using "api.pushshift.io/reddit/submission/search/" directly is useful but it seems only work for very limited period of time (i.e., one month or so).

If I want to retrieve something earlier than a month, say, something in 12 months ago, calling this link (see below) in browser will only generate blank list:

link: https://api.pushshift.io/reddit/submission/search?q=&after=1633010400&before=1640236250&subreddit=science&limit=999

Specific return: {"data":[],"error":null,"metadata":{"es":{"took":8,"timed_out":false,"_shards":{"total":4,"successful":4,"skipped":3,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null}},"es_query":{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{\"size\":999,\"query\":{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"range\":{\"created_utc\":{\"gte\":1633010400000}}},{\"range\":{\"created_utc\":{\"lt\":1640236250000}}}]}},{\"bool\":{\"should\":[{\"match\":{\"subreddit\":\"science\"}}],\"minimum_should_match\":1}}]}},\"aggs\":{},\"sort\":{\"created_utc\":\"desc\"}}"}}

mattpodolak commented 1 year ago

There's no working alternative version in the meantime?

Not right now. I'm hesitant to release a version that I havent fully tested, however, if everything goes well I will be able to release today 🙏🏾

mattpodolak commented 1 year ago

I also encountered this issue.

Using "api.pushshift.io/reddit/submission/search/" directly is useful but it seems only work for very limited period of time (i.e., one month or so).

If I want to retrieve something earlier than a month, say, something in 12 months ago, calling this link (see below) in browser will only generate blank list:

link: https://api.pushshift.io/reddit/submission/search?q=&after=1633010400&before=1640236250&subreddit=science&limit=999

Specific return: {"data":[],"error":null,"metadata":{"es":{"took":8,"timed_out":false,"_shards":{"total":4,"successful":4,"skipped":3,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null}},"es_query":{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}}"}}

theres been some parameter changes: until is the new before and since is the new after

looking at the COLO switchover bug thread, it looks like some other people have had the 1 month of data issue

eddvrs commented 1 year ago

Yes, they've not loaded in the old historical data yet. I think it's due soon, but I've not been following too closely last couple of days. I'd be happy to revisit my changes once more testing is possible.

mattpodolak commented 1 year ago

closing this as the COLO switch over fixes have been merged + released in version 3.0.0!

RadoslavL commented 8 months ago

I am still getting the same issue in version 3.0.0 with this code:

api = PushshiftAPI(praw=reddit)
posts = api.search_submissions(subreddit="science", limit=10000)
post_list = [p for p in posts]
print(len(post_list))

The print call returns 0.

Edit: Scratch that, it's a completely different problem. The API is locked for unregistered users.