shadowmoose / RedditDownloader

Scrapes Reddit to download media of your choice.
1.11k stars 99 forks source link

Other Updates, but pushshift.io NOT WORK! #285

Open gianfelicevincenzo opened 1 year ago

gianfelicevincenzo commented 1 year ago

The last update was a month ago, but when trying to download from pushshift.io it doesn't work! Why don't you fix this situation that has been going on for months? Thank you

shadowmoose commented 1 year ago

PushShift is currently broken, due to API restrictions that Reddit staff are implementing. As a result, I will be unable to support any further PushShift development until (and if) they work something out with Reddit.

toadthetoad commented 1 year ago

So the CSV download is effectively dead now? The --full_csv flag seems to imply it will bypass the need for PushShift but it fails in the same way. Is there an easy bypass?

JulianKauth commented 1 year ago

The psaw library seems to be pretty heavily integrated into the project. Not even direct url download works without it.

Though that might be easier to hack, as far as I can tell in that case pushshift is only needed to get the metadata from a reddit post to create an instance of the processing.wrappers.redditelement.RedditElement class. The unfortunate lack of type annotations in the project doesn't make it easy though.

PS: @vincenzogianfelice I am appalled by the entitlement displayed in your comment. This software is provided entirely free of charge, the least you could do is to be nice to the developer.

shadowmoose commented 1 year ago

Yes, this functionality is currently broken - and likely both in this version and in the TypeScript rewrite. Reddit comments and submissions generally change or are lost over a long enough window, and coupled with the fact that the official Reddit API is (or was) extremely slow for individual lookups, PushShift was implemented as the sole solution for single targets.

For people using this functionality, the reason is generally because they have more saved posts than the official API will return (capped at 1000), so typical CSV downloads will have many thousands of posts to scan. Frankly, the Reddit API is unsuitable for this task. Due to the harsh rate limiting of their API, and also because of their general slow response time, processing a CSV directly through official means would take a significant amount of time. Ignoring the API response times and skipping the actual download calls, which use additional API queries in some cases, the optimistic run time just to retrieve 1000 individual posts within API limits is 30+ minutes. This also ignores any old deleted or edited posts, where the data will be completely unrecoverable. In the rewritten TS version, PushShift functionality was mandatory in order to reliably build relationships between saved comments and their parents, in the event that the parent submission had been removed from the live site.

This probably isn't the best place to discuss, but I may as well dump it here on the most recent issue caused by Reddit actions: Bluntly, it's changes like this that have driven me away from supporting these Reddit-backed applications. They've been teasing for years now that they intend to restrict their API further and further, and it makes investing my energy into these projects seem like a tremendous waste. I've been directing my focus lately towards a more convenient, site-agnostic method of preserving media, which I'd rather push forward with rather than supporting a site that doesn't support its users in return.

Suffice it to say that I'm unlikely to expend much effort towards bringing these features back in the short term. I have very limited time to work on my passion projects these days, and I would prefer not to waste that time stepping into adversarial relationships with social media site developers.

If they get things sorted out with PushShift, then everything should start working again and I'll be more encouraged to move forward with completing the rewrite, which also heavily utilizes PS. If not... well, it will likely be impossible to reimplement the lost functionality to the level people expect from the application. The code to add a bandage fix exists - scattered around - within RMD already, and I'm very open to accepting Pull Requests, but I probably won't be the one implementing it. The fix would only get RMD limping along, and honestly that seems likely to only raise more complaints and issues. At this point I'll be keeping an eye out for future Reddit API developments, and should anything come up, I'll be happy to revisit this.

ghost commented 1 year ago

Good news, it seems like they sorted things out with PushShift and it is coming back in the following month. Bad news is that

"use of Pushshift will be limited to moderation use cases only."

"Though access to Pushshift data for research purposes is not available at this time, , we are keen to explore possibilities that might allow us to provide researchers with access to datasets essential for their valuable social media research. We understand the significance of empowering the academic community, and we are dedicated to working with Reddit to develop frameworks that responsibly balance data access, data security, and user privacy."

source