Closed jonlee112 closed 2 years ago
Hey @jonlee112, you can do it like this:
comments = api.search_comments(q='"don\'t hate"+Thomas')
- returns 1044 comments which is the same as what is reported when querying directly https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas&metadata=true
@mattpodolak thanks so much!
Hey @jonlee112, you can do it like this:
comments = api.search_comments(q='"don\'t hate"+Thomas')
- returns 1044 comments which is the same as what is reported when querying directly https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas&metadata=true
Hi matt, could I ask about how to scraping comments that either contain "Key word A" or "Key word B" , that is to gather comments contains only A, only B or both A and B. For instance the key words are "GameStop" and "GME"? Thank you so much!
@Ldudu1108 hey, you might have some success with the || operator. I did a quick test using the api directly and it appears to have worked: https://api.pushshift.io/reddit/search/comment/?q=rome||greece&subreddit=askhistorians&after=30d
Dear Matt,
Thank you so much for your reply. I tried '|' and it worked. However I encountered another problem that I hope you could help me with.
I want to gather two sets of comments: 1. comment itself contains the keyword 2. comments of the post which contains the keyword (comment itself might not include the keyword) and concatenate them, dropping all duplicates based on comment id.
The results I get is that there are no duplicates between these two sets of comments but I clearly can see there are comments from set 2 that contains the keyword. So I think maybe the set 1 I gathered is not complete?
I have attached the ipynb file and would really appreciate if you could take a quick look.
Best, Ang
From: 李昂 @.> Sent: Friday, January 28, 2022 1:33 PM To: Ang Li @.> Subject: Fw: Re: [mattpodolak/pmaw] Help encoding a search query for multiple keywords (Issue #37)
发自我的iPhone
------------------ Original ------------------ From: Matthew Podolak @.> Date: Wed,Jan 26,2022 7:58 PM To: mattpodolak/pmaw @.> Cc: Ldudu1108 @.>, Mention @.> Subject: Fw: Re: [mattpodolak/pmaw] Help encoding a search query for multiple keywords (Issue #37)
@Ldudu1108https://github.com/Ldudu1108 hey, you might have some success with the || operator. I did a quick test using the api directly and it appears to have worked: https://api.pushshift.io/reddit/search/comment/?q=rome||greece&subreddit=askhistorians&after=30dhttps://api.pushshift.io/reddit/search/comment/?q=rome%7C%7Cgreece&subreddit=askhistorians&after=30d
― Reply to this email directly, view it on GitHubhttps://github.com/mattpodolak/pmaw/issues/37#issuecomment-1022552232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXOLN2SRNQJ7OGVTNWUBFDDUYBG6ZANCNFSM5K3E4UEA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>
@Ldudu1108 I cant see an attach ipynb file, but I think someone else was facing a similar issue here: https://github.com/mattpodolak/pmaw/issues/31
If you have further questions, please share the minimal amount of code required to reproduce the problem you're having
@Ldudu1108 I cant see an attach ipynb file, but I think someone else was facing a similar issue here: #31
If you have further questions, please share the minimal amount of code required to reproduce the problem you're having
@mattpodolak Sorry about that and i think the problem is different. I would really appreciate it if you could help me take a look. Thank you so much! So I have searched two sets of comments in the same time period: one for comments themselves contain 'GameStop|GME' and second for comments which their posts contain 'GameStop|GME'. It showed me there is no overlap between the two sets of comments. But I have found that come of the second set of comments contain 'GameStop|GME'. I guess this means my first set is not complete. It is just a little strange how my first set of comments can perfectly avoid every comment from second set.
The following is my code:
import pandas as pd
pip install pmaw
from pmaw import PushshiftAPI
api = PushshiftAPI()
import datetime as dt
start_epoch=int(dt.datetime(2020, 12, 1, 0, 0).timestamp())
end_epoch=int(dt.datetime(2020, 12, 1, 0, 30).timestamp())
submissions = api.search_submissions(q='GAMESTOP|GME', after=start_epoch, before=end_epoch,subreddit="wallstreetbets",limit=None)
comments = api.search_comments(q='GAMESTOP|GME', after=start_epoch, before=end_epoch,subreddit="wallstreetbets",limit=None)
sub_df = pd.DataFrame(submissions)
con_df = pd.DataFrame(comments)
comment = api.search_submission_comment_ids(ids=sub_df['id'])
comment_id_list = [c_id for c_id in comment]
post_comments = api.search_comments(ids=comment_id_list, after=start_epoch, before=end_epoch,subreddit="wallstreetbets",filter=['author','body','created_utc','id', 'link_id', 'parent_id','score'],limit=None)
post_comments_df = pd.DataFrame(post_comments)
concat = pd.concat([post_comments_df,con_df])
no_dup_concat = concat.drop_duplicates(subset=['id'])
no_dup_concat
@Ldudu1108
Can you share the ids of the comments that were found in the second set containing Gamestop|GME but were not found in the first set?
There might be some issues with the parameters for the second set, when you search comments using their ids the before
and after
parameters may not be respected, also the filter
parameter should be fields
.
@Ldudu1108
Can you share the ids of the comments that were found in the second set containing Gamestop|GME but were not found in the first set?
There might be some issues with the parameters for the second set, when you search comments using their ids the
before
andafter
parameters may not be respected, also thefilter
parameter should befields
.
Thank you so much for your reply and sorry about the late response.I realised that i should not put time restriction when i search for posts. I should get all GME related posts and then select attached comments within my restricted time window. However, it shows that i have 201414 GME related posts in total and when i try to use the post id to get all comments, the search speed is extremely slow. It can only go through less than 100 per search so I am still working on it to see if the results this time are correct.
Hi there, I would like to use your wonderful wrapper PMAW to conduct the following data search for all comments with the following keywords "don't hate" AND "Thomas" (i.e., any comments that contain "don't hate" and "Thomas" somewhere).
The call to the pushshift api directly works fine: https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas
However, I cannot figure out how to translate this q="..." search query into a format that works in PMAW (or PSAW for that matter)...
For instance, the following returns 0: comments = api.search_comments(q="%22don%27t%20hate%22+Thomas", limit=limit, before=before, after=after)
Been googling all day without any success.