Closed SwiftWinds closed 2 years ago
Hey @SwiftWinds, unfortunately, the way you described is the best way to get a list of comments from a list of submission IDs
Hey @SwiftWinds, unfortunately, the way you described is the best way to get a list of comments from a list of submission IDs
I see... would it make sense to run each api.search_comments
call in parallel (e.g., via asyncio or multiple threads)? Or would sequentially via a for loop make more sense?
Running each search_comments
in parallel will cause a bottleneck due to the rate limiting, the for loop should be preferred as Pushshift will reject less requests.
Another thing you could try is using two PMAW methods sequentially, maybe something like this:
comment_ids = api.search_submission_comment_ids(ids=post_ids)
comments = api.search_comments(ids=comment_ids)
If you run api.search_comments via
OK, I ran this method:
comment_ids = api.search_submission_comment_ids(ids=post_ids)
comments = api.search_comments(ids=comment_ids)
and it's way faster than doing so via api.search_comments(subreddit=subreddit, link_id=post_id)
in a for loop (~20 seconds vs ~400 seconds). However, it's returning far fewer comments (1095 vs 5012). Do you know why that might be the case? I can provide code examples comparing the two if you'd like.
Thats a huge discrepancy, can you provide the code required so I can run this and look into it further?
Sure! Here's a minimal example: https://replit.com/@matetoes/fetch-comments-pmaw#main.py :)
Sorry for the bump, but did you get a chance to take a look at my example?
I removed the usage of pyfunctional
and made the example much more readable :)
hey, I found some time to look into this.
I would use a combination of the results, removing the duplicate comment_ids, and maybe even verifying that they belong to the correct posts.
It looks like there are some data consistency issues occurring in the Pushshift DBs.
From the search_submission_comment_ids
it appears to return the number of comments that match the num_comments
reported by search_submissions
for n5w7p3
(147), cbxpnj
(140) , k7x6xv
(8), and e3kndv
(51). Except 4gib6l
returns 0 instead of 5797, and evryo
0 instead of 3.
When looking at 8l7g9t
, g4jrx8
, and 1ukv5u
we cant verify the number of comments as search_submissions
returns nothing
From the
search_submission_comment_ids
it appears to return the number of comments that match thenum_comments
reported bysearch_submissions
forn5w7p3
(147),cbxpnj
(140) ,k7x6xv
(8), ande3kndv
(51). Except4gib6l
returns 0 instead of 5797, andevryo
0 instead of 3.When looking at
8l7g9t
,g4jrx8
, and1ukv5u
we cant verify the number of comments assearch_submissions
returns nothing
Hm, I see. Why is search_submissions
returning such inconsistent results? Any workarounds? I guess I could fall back to the slow method if I see that search_submissions
has returned an inconsistent result, but how would I know that it is inconsistent (e.g., it could be possible that there are 0 comments in that particular thread).
It's more so that Pushshift itself is returning the inconsistent results, and the PMAW wrapper is passing them along to you. I cant really speak to whats happened to the DB to cause this, but if you find a large disparity for your queries and are dealing with low volumes you can use PRAW and grab the data directly from reddit.
I see. My application is real-time, so I can't really deal with the slow retrieval times of PRAW or the api.search_comments(subreddit, link_id=post_id)
method. I've resolved to simply add .json
at the end of the Reddit URL (like this: https://www.reddit.com/r/algotrading/comments/cr7jey/ive_reproduced_130_research_papers_about.json), and it seems to perform way faster than pushshift or praw. Here's what I found:
praw - 6644 comments - 831.5879390239716 seconds pmawfast - 1145 comments - 16.530805110931396 seconds pmawslow - 5192 comments - 492.28449511528015 seconds pushshiftraw - 1678 comments - 15.926753997802734 seconds addjson - 1523 comments - 3.22 seconds
The method of adding .json truncates for threads longer than 500 comments (which is why the # of comments is so abysmally low), but for threads longer than 500 (which is the tiny minority of Reddit threads relevant to me), I'll just store them on my database for fast retrieval.
I suppose since this issue is upstream with Pushshift, we can close it now. Thanks for the help and guidance!
I know you can get the list of comment IDs with
comment_id_dict = api.search_submission_comment_ids(ids=post_ids)
and you can get a list of actual comments from a single submission ID withcomments = list(api.search_comments(subreddit, link_id=post_id))
.Any way to get a list of actual comments from a list of submission IDs?
I know I can probably for loop through the submission IDs and call
api.search_comments(subreddit, link_id=post_id)
on each of them, but I'm assuming thatpmaw
's multithreading magic wouldn't be used to the bests of its abilities if I did it this way.Thanks! :)