mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.87k stars 976 forks source link

[Feature Request] --abort ignoring type of extractor being used #1399

Closed sourmilk01 closed 3 years ago

sourmilk01 commented 3 years ago

I've noticed that using --abort with a site that has uses different image hosts (such as reddit with reddit, imgur, gfycat, redgifs content posted) will cause the --abort feature to get interrupted before it hits n if it switches to a different extractor (e.g. --abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets).

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Is there a way to have --abort ignore what the type of extractor is being used and if not, could that feature be added?

mikf commented 3 years ago

--abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets

It doesn't completely reset, i.e. the number for skipped reddit posts is still at 4 and the next one will trigger --abort, but it uses a different/new "skipped" count for each URL.

I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

sourmilk01 commented 3 years ago

Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?

That was my thought; I turned off parent-metadata to test, and it appears that when queuing a reddit URL, --abort n will only count reddit-hosted images and ignore imgur images for the n count. That being said, I saw it skip several dozen already downloaded imgur files even though n was set at 5.

I'm not sure how you would implement that; would you add a new variant "global" --abort, or would you change the original functionality of --abort to count all child-extractors as opposed to only the parent extractor?

sourmilk01 commented 3 years ago

I forgot to mention, my main reason for requesting this was related to the imgur rate-limit issues I previously asked about (#1386).

By far, imgur has the worst rate-limiting out of all the sites I've seen (1,250 requests per hour; 12,500 per day; if daily rate is hit 5 times in a month your IP gets blocked for the rest of the month).

I've found that when scraping a subreddit or reddit user page that has mostly imgur links, the cap is hit fairly quick; even when files are already downloaded, --abort will fail to stop so it continues to skip until it reaches the hourly cap.

sourmilk01 commented 3 years ago

@mikf I've managed to mitigate my imgur-rate issues with a shoddy workaround (manually identifying and setting aside subreddits and users that were imgur-post heavy).

I still have scrape speed issues when it comes to gfycat/redgif, some subreddits almost exclusively use media from those site so they essentially never abort and have to parse the whole ~1,000 posts available before the next URL.

Any idea on when this type of --abort could be implemented? If it would take too much time to set for every extractor, would it be easier to just set it for imgur/gfycat/redgif (specifcally reddit)?

razielgn commented 3 years ago

This issue also comes up for example with behance, when using a profile (with contains multiple projects) as input: the skip counter resets on every project, as they are handled as different jobs. Is it reasonable to implement a global skip counter or is there a different way to handle this?

mikf commented 3 years ago

@sourmilk01 I think 7ab83743 combined with c693db5b and dfe1e09d solves your problem.

sourmilk01 commented 3 years ago

@mikf Wow! Thank you so much!

I just tested it myself, and parent-skip and --terminate are working on reddit as intended; I think my scrape speed is a fifth or maybe even a tenth of what it was before.

@razielgn You should try testing it on behance.

razielgn commented 3 years ago

Works great, thank you @mikf!

Hrxn commented 3 years ago

@sourmilk01 Is there any specific reason for not using the archive file option here?

mikf commented 3 years ago

@Hrxn the problem here isn't detecting an already downloaded file, but gallery-dl's action when finding one in combination with parent and child extractors, e.g. Reddit and Imgur. Any skipped download on one Imgur URL didn't propagate to its parent or other children and didn't count towards the overall "skip limit". Hitting said "skip limit" on an Imgur URL also wasn't able to halt the download for its Reddit parent, only itself.