mattpodolak / pmaw

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
MIT License
212 stars 28 forks source link

Get comments given list of submission ids #31

Closed SwiftWinds closed 2 years ago

SwiftWinds commented 2 years ago

I know you can get the list of comment IDs with comment_id_dict = api.search_submission_comment_ids(ids=post_ids) and you can get a list of actual comments from a single submission ID with comments = list(api.search_comments(subreddit, link_id=post_id)).

Any way to get a list of actual comments from a list of submission IDs?

I know I can probably for loop through the submission IDs and call api.search_comments(subreddit, link_id=post_id) on each of them, but I'm assuming that pmaw's multithreading magic wouldn't be used to the bests of its abilities if I did it this way.

Thanks! :)

mattpodolak commented 2 years ago

Hey @SwiftWinds, unfortunately, the way you described is the best way to get a list of comments from a list of submission IDs

SwiftWinds commented 2 years ago

Hey @SwiftWinds, unfortunately, the way you described is the best way to get a list of comments from a list of submission IDs

I see... would it make sense to run each api.search_comments call in parallel (e.g., via asyncio or multiple threads)? Or would sequentially via a for loop make more sense?

mattpodolak commented 2 years ago

Running each search_comments in parallel will cause a bottleneck due to the rate limiting, the for loop should be preferred as Pushshift will reject less requests.

Another thing you could try is using two PMAW methods sequentially, maybe something like this:

comment_ids = api.search_submission_comment_ids(ids=post_ids)
comments = api.search_comments(ids=comment_ids)

If you run api.search_comments via

SwiftWinds commented 2 years ago

OK, I ran this method:

comment_ids = api.search_submission_comment_ids(ids=post_ids)
comments = api.search_comments(ids=comment_ids)

and it's way faster than doing so via api.search_comments(subreddit=subreddit, link_id=post_id) in a for loop (~20 seconds vs ~400 seconds). However, it's returning far fewer comments (1095 vs 5012). Do you know why that might be the case? I can provide code examples comparing the two if you'd like.

mattpodolak commented 2 years ago

Thats a huge discrepancy, can you provide the code required so I can run this and look into it further?

SwiftWinds commented 2 years ago

Sure! Here's a minimal example: https://replit.com/@matetoes/fetch-comments-pmaw#main.py :)

SwiftWinds commented 2 years ago

Sorry for the bump, but did you get a chance to take a look at my example?

I removed the usage of pyfunctional and made the example much more readable :)

mattpodolak commented 2 years ago

hey, I found some time to look into this.

I would use a combination of the results, removing the duplicate comment_ids, and maybe even verifying that they belong to the correct posts.

It looks like there are some data consistency issues occurring in the Pushshift DBs.

From the search_submission_comment_ids it appears to return the number of comments that match the num_comments reported by search_submissions for n5w7p3 (147), cbxpnj (140) , k7x6xv (8), and e3kndv (51). Except 4gib6l returns 0 instead of 5797, and evryo 0 instead of 3.

When looking at 8l7g9t, g4jrx8, and 1ukv5u we cant verify the number of comments as search_submissions returns nothing

SwiftWinds commented 2 years ago

From the search_submission_comment_ids it appears to return the number of comments that match the num_comments reported by search_submissions for n5w7p3 (147), cbxpnj (140) , k7x6xv (8), and e3kndv (51). Except 4gib6l returns 0 instead of 5797, and evryo 0 instead of 3.

When looking at 8l7g9t, g4jrx8, and 1ukv5u we cant verify the number of comments as search_submissions returns nothing

Hm, I see. Why is search_submissions returning such inconsistent results? Any workarounds? I guess I could fall back to the slow method if I see that search_submissions has returned an inconsistent result, but how would I know that it is inconsistent (e.g., it could be possible that there are 0 comments in that particular thread).

mattpodolak commented 2 years ago

It's more so that Pushshift itself is returning the inconsistent results, and the PMAW wrapper is passing them along to you. I cant really speak to whats happened to the DB to cause this, but if you find a large disparity for your queries and are dealing with low volumes you can use PRAW and grab the data directly from reddit.

SwiftWinds commented 2 years ago

I see. My application is real-time, so I can't really deal with the slow retrieval times of PRAW or the api.search_comments(subreddit, link_id=post_id) method. I've resolved to simply add .json at the end of the Reddit URL (like this: https://www.reddit.com/r/algotrading/comments/cr7jey/ive_reproduced_130_research_papers_about.json), and it seems to perform way faster than pushshift or praw. Here's what I found:

praw - 6644 comments - 831.5879390239716 seconds pmawfast - 1145 comments - 16.530805110931396 seconds pmawslow - 5192 comments - 492.28449511528015 seconds pushshiftraw - 1678 comments - 15.926753997802734 seconds addjson - 1523 comments - 3.22 seconds

The method of adding .json truncates for threads longer than 500 comments (which is why the # of comments is so abysmally low), but for threads longer than 500 (which is the tiny minority of Reddit threads relevant to me), I'll just store them on my database for fast retrieval.

I suppose since this issue is upstream with Pushshift, we can close it now. Thanks for the help and guidance!