pushshift / api

Pushshift API
1.29k stars 107 forks source link

Scraping comments from pushshift #141

Open swaheera opened 1 year ago

swaheera commented 1 year ago

I am using the Reddit API (Pushshift) : https://github.com/pushshift/api

Using the documentation, I understand how I can use this to extract every comment containing the word "covid" that was left in a certain time period:

https://api.pushshift.io/reddit/search/comment?q=covid&after=3h&before=2h&size=1

The output looks something like this:

{"data":[{"subreddit_id":"t5_2qh6p","author_is_blocked":false,"comment_type":null,"edited":false,"author_flair_type":"richtext","total_awards_received":0,"subreddit":"Conservative","author_flair_template_id":null,"id":"j98zf27","gilded":0,"archived":false,"collapsed_reason_code":null,"no_follow":false,"author":"VamboRoolOkay","send_replies":true,"parent_id":41917615743,"score":1,"author_fullname":"t2_7uxkru5f","all_awardings":[],"body":"I will never believe that election fraud wasn't a significant factor. Go ahead - call it a conspiracy theory. But I also maintained that Covid was lab-created. Truth is the Daughter of Time.","top_awarded_type":null,"author_flair_css_class":null,"author_patreon_flair":false,"collapsed":false,"author_flair_richtext":[{"e":"text","t":"Conservative"}],"is_submitter":false,"gildings":{},"collapsed_reason":null,"associated_award":null,"stickied":false,"author_premium":false,"can_gild":true,"link_id":"t3_116l7ct","unrepliable_reason":null,"author_flair_text_color":"dark","score_hidden":true,"permalink":"/r/Conservative/comments/116l7ct/kamala_harris_plans_on_running_with_biden_in_2024/j98zf27/","subreddit_type":"public","locked":false,"author_flair_text":"Conservative","treatment_tags":[],"created_utc":1676866031,"subreddit_name_prefixed":"r/Conservative","controversiality":0,"author_flair_background_color":"","collapsed_because_crowd_control":null,"distinguished":null,"retrieved_utc":1676866047,"updated_utc":1676866048,"body_sha1":"328df3784d15f77b98a84418c4ce720822227cfe","utc_datetime_str":"2023-02-20 04:07:11"}],"error":null,"metadata":{"es":{"took":98,"timed_out":false,"_shards":{"total":828,"successful":828,"skipped":824,"failed":0},"hits":{"total":{"value":573,"relation":"eq"},"max_score":null}},"es_query":{"size":1,"query":{"bool":{"must":[{"bool":{"must":[{"simple_query_string":{"fields":["body"],"query":"covid","default_operator":"and"}},{"range":{"created_utc":{"gte":1676862433000}}},{"range":{"created_utc":{"lt":1676866033000}}}]}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{\"size\":1,\"query\":{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"simple_query_string\":{\"fields\":[\"body\"],\"query\":\"covid\",\"default_operator\":\"and\"}},{\"range\":{\"created_utc\":{\"gte\":1676862433000}}},{\"range\":{\"created_utc\":{\"lt\":1676866033000}}}]}}]}},\"aggs\":{},\"sort\":{\"created_utc\":\"desc\"}}","api_launch_time":1673017478.254743,"api_request_start":1676873233.6143198,"api_request_end":1676873233.7406816,"api_total_time":0.12636184692382812}}

My Question: Suppose I identify a post that contains the word "covid" - now, I want to retrieve EVERY comment on this post (regardless if it contains the word "covid" or not) : Is this possible to do?

For instance, based on the output of these results, I see that :

Can I somehow use this information to write an API query to retrieve all comments from this post?

I tried the following query but got an empty result: https://api.pushshift.io/reddit/comment/search/?link_id=t3_116cjib

Thanks!

Note:  Is it possible to perform this task using a "two stage approach"? E.g. Stage 1 - identify a post where the word "covid" was left in the comments. Stage 2 - begin to extract ALL comments from this post (regardless if they contain "covid" or not)

turkeyscramble commented 1 year ago

I'm trying to run https://api.pushshift.io/reddit/search/submission/comment_ids/%s" % submission['id'] and it fails with 404 so im assuming its broken