mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

Help encoding a search query for multiple keywords #37

Closed jonlee112 closed 2 years ago

jonlee112 commented 2 years ago

Hi there, I would like to use your wonderful wrapper PMAW to conduct the following data search for all comments with the following keywords "don't hate" AND "Thomas" (i.e., any comments that contain "don't hate" and "Thomas" somewhere).

The call to the pushshift api directly works fine: https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas

However, I cannot figure out how to translate this q="..." search query into a format that works in PMAW (or PSAW for that matter)...

For instance, the following returns 0: comments = api.search_comments(q="%22don%27t%20hate%22+Thomas", limit=limit, before=before, after=after)

Been googling all day without any success.

mattpodolak commented 2 years ago

Hey @jonlee112, you can do it like this:

comments = api.search_comments(q='"don\'t hate"+Thomas') - returns 1044 comments which is the same as what is reported when querying directly https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas&metadata=true

jonlee112 commented 2 years ago

@mattpodolak thanks so much!

Ldudu1108 commented 2 years ago

Hey @jonlee112, you can do it like this:

comments = api.search_comments(q='"don\'t hate"+Thomas') - returns 1044 comments which is the same as what is reported when querying directly https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas&metadata=true

Hi matt, could I ask about how to scraping comments that either contain "Key word A" or "Key word B" , that is to gather comments contains only A, only B or both A and B. For instance the key words are "GameStop" and "GME"? Thank you so much!

mattpodolak commented 2 years ago

@Ldudu1108 hey, you might have some success with the || operator. I did a quick test using the api directly and it appears to have worked: https://api.pushshift.io/reddit/search/comment/?q=rome||greece&subreddit=askhistorians&after=30d

Ldudu1108 commented 2 years ago

Dear Matt,

Thank you so much for your reply. I tried '|' and it worked. However I encountered another problem that I hope you could help me with.

I want to gather two sets of comments: 1. comment itself contains the keyword 2. comments of the post which contains the keyword (comment itself might not include the keyword) and concatenate them, dropping all duplicates based on comment id.

The results I get is that there are no duplicates between these two sets of comments but I clearly can see there are comments from set 2 that contains the keyword. So I think maybe the set 1 I gathered is not complete?

I have attached the ipynb file and would really appreciate if you could take a quick look.

Best, Ang


From: 李昂 @.> Sent: Friday, January 28, 2022 1:33 PM To: Ang Li @.> Subject: Fw: Re: [mattpodolak/pmaw] Help encoding a search query for multiple keywords (Issue #37)


发自我的iPhone

------------------ Original ------------------ From: Matthew Podolak @.> Date: Wed,Jan 26,2022 7:58 PM To: mattpodolak/pmaw @.> Cc: Ldudu1108 @.>, Mention @.> Subject: Fw: Re: [mattpodolak/pmaw] Help encoding a search query for multiple keywords (Issue #37)

@Ldudu1108https://github.com/Ldudu1108 hey, you might have some success with the || operator. I did a quick test using the api directly and it appears to have worked: https://api.pushshift.io/reddit/search/comment/?q=rome||greece&subreddit=askhistorians&after=30dhttps://api.pushshift.io/reddit/search/comment/?q=rome%7C%7Cgreece&subreddit=askhistorians&after=30d

― Reply to this email directly, view it on GitHubhttps://github.com/mattpodolak/pmaw/issues/37#issuecomment-1022552232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXOLN2SRNQJ7OGVTNWUBFDDUYBG6ZANCNFSM5K3E4UEA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

mattpodolak commented 2 years ago

@Ldudu1108 I cant see an attach ipynb file, but I think someone else was facing a similar issue here: https://github.com/mattpodolak/pmaw/issues/31

If you have further questions, please share the minimal amount of code required to reproduce the problem you're having

Ldudu1108 commented 2 years ago

@Ldudu1108 I cant see an attach ipynb file, but I think someone else was facing a similar issue here: #31

If you have further questions, please share the minimal amount of code required to reproduce the problem you're having

@mattpodolak Sorry about that and i think the problem is different. I would really appreciate it if you could help me take a look. Thank you so much! So I have searched two sets of comments in the same time period: one for comments themselves contain 'GameStop|GME' and second for comments which their posts contain 'GameStop|GME'. It showed me there is no overlap between the two sets of comments. But I have found that come of the second set of comments contain 'GameStop|GME'. I guess this means my first set is not complete. It is just a little strange how my first set of comments can perfectly avoid every comment from second set.

The following is my code:

import pandas as pd pip install pmaw from pmaw import PushshiftAPI api = PushshiftAPI() import datetime as dt start_epoch=int(dt.datetime(2020, 12, 1, 0, 0).timestamp()) end_epoch=int(dt.datetime(2020, 12, 1, 0, 30).timestamp()) submissions = api.search_submissions(q='GAMESTOP|GME', after=start_epoch, before=end_epoch,subreddit="wallstreetbets",limit=None) comments = api.search_comments(q='GAMESTOP|GME', after=start_epoch, before=end_epoch,subreddit="wallstreetbets",limit=None) sub_df = pd.DataFrame(submissions) con_df = pd.DataFrame(comments) comment = api.search_submission_comment_ids(ids=sub_df['id']) comment_id_list = [c_id for c_id in comment] post_comments = api.search_comments(ids=comment_id_list, after=start_epoch, before=end_epoch,subreddit="wallstreetbets",filter=['author','body','created_utc','id', 'link_id', 'parent_id','score'],limit=None) post_comments_df = pd.DataFrame(post_comments) concat = pd.concat([post_comments_df,con_df]) no_dup_concat = concat.drop_duplicates(subset=['id']) no_dup_concat

mattpodolak commented 2 years ago

@Ldudu1108

Can you share the ids of the comments that were found in the second set containing Gamestop|GME but were not found in the first set?

There might be some issues with the parameters for the second set, when you search comments using their ids the before and after parameters may not be respected, also the filter parameter should be fields.

Ldudu1108 commented 2 years ago

@Ldudu1108

Can you share the ids of the comments that were found in the second set containing Gamestop|GME but were not found in the first set?

There might be some issues with the parameters for the second set, when you search comments using their ids the before and after parameters may not be respected, also the filter parameter should be fields.

Thank you so much for your reply and sorry about the late response.I realised that i should not put time restriction when i search for posts. I should get all GME related posts and then select attached comments within my restricted time window. However, it shows that i have 201414 GME related posts in total and when i try to use the post id to get all comments, the search speed is extremely slow. It can only go through less than 100 per search so I am still working on it to see if the results this time are correct.