mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.29k stars 921 forks source link

Reddit - Doesn't download all posts in a subreddit #671

Open aeriessy opened 4 years ago

aeriessy commented 4 years ago

Not sure if this is a bug or not. I would love to download the entirety of a subreddit's posts if possible. My input is standard:

gallery-dl https://www.reddit.com/r/EarthPorn/

This particular subreddit downloads about 800 images before it stops. Using a different Reddit post extractor, I got about 300,000 images from it (it was very buggy and froze a lot) so I know the whole subreddit is more than 800 images. gallery-dl has been an absolute champ with deviantart and pixiv galleries so the ideal situation would be if this had the capability. If I could get some insight of how this may be fixed, I'm all ears. Please let me know if I'm doing something incorrect.

Thank you for your time!

mikf commented 4 years ago

Reddit's API limits every listing to 1000 items (https://praw.readthedocs.io/en/v3.6.0/pages/getting_started.html#obfuscation-and-api-limitations), and to my knowledge there is nothing that can be done about that. Maybe you can get a few more images from the new, top, etc listings on each subreddit.

a different Reddit post extractor

Which one? Maybe I can take a look at what they are doing and incorporate that here.

aeriessy commented 4 years ago

Hm.. alright, I'll see what I can do with it. I just wanted to make sure I didn't miss anything.

I used Reddit Media Downloader. While it worked well, I needed to download by half years for it to stop freezing. I made the time requirements in 6 month intervals from the oldest post in 2011 and went from there, deleting the previous requirements and putting in new ones. At the end (span of a week or so), I had 8 folders for each year, about 328k files. It didn't download the most recent posts (like if I collected the data from the beginning of the year til' now, it would only download until January 4, 2020 and stop). I didn't prefer it for long term data collection and downloading new content.

mikf commented 4 years ago

OK, so "Reddit Media Downloader" uses psaw to browse Reddit posts, which in turn uses the pushshift.io API, an external archiving "service" for Reddit posts. They even have their own subreddit: https://www.reddit.com/r/pushshift/

I'll see what I can do to incorporate this into the current reddit code.

aeriessy commented 4 years ago

Thank you for looking into it! If there is anything I can do, please let me know. I greatly appreciate your dedication to developing this program and others.

ofifoto commented 1 year ago

I'd love to see this too if at all possible - thanks for all you do :)