Closed sourmilk01 closed 3 years ago
--abort 5 and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets
It doesn't completely reset, i.e. the number for skipped reddit posts is still at 4 and the next one will trigger --abort
, but it uses a different/new "skipped" count for each URL.
I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped, --abort won't apply because they aren't reddit-extracted posts.
Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort
that also counts any files from child-extractors?
Exactly. It will stop for the current site, e.g. imgur, but will continue with reddit regardless. So you'd need a "global" --abort that also counts any files from child-extractors?
That was my thought; I turned off parent-metadata
to test, and it appears that when queuing a reddit URL, --abort n
will only count reddit-hosted images and ignore imgur images for the n
count. That being said, I saw it skip several dozen already downloaded imgur files even though n
was set at 5.
I'm not sure how you would implement that; would you add a new variant "global" --abort
, or would you change the original functionality of --abort
to count all child-extractors as opposed to only the parent extractor?
I forgot to mention, my main reason for requesting this was related to the imgur rate-limit issues I previously asked about (#1386).
By far, imgur has the worst rate-limiting out of all the sites I've seen (1,250 requests per hour; 12,500 per day; if daily rate is hit 5 times in a month your IP gets blocked for the rest of the month).
I've found that when scraping a subreddit or reddit user page that has mostly imgur links, the cap is hit fairly quick; even when files are already downloaded, --abort
will fail to stop so it continues to skip until it reaches the hourly cap.
@mikf I've managed to mitigate my imgur-rate issues with a shoddy workaround (manually identifying and setting aside subreddits and users that were imgur-post heavy).
I still have scrape speed issues when it comes to gfycat/redgif, some subreddits almost exclusively use media from those site so they essentially never abort and have to parse the whole ~1,000 posts available before the next URL.
Any idea on when this type of --abort
could be implemented? If it would take too much time to set for every extractor, would it be easier to just set it for imgur/gfycat/redgif (specifcally reddit)?
This issue also comes up for example with behance, when using a profile (with contains multiple projects) as input: the skip counter resets on every project, as they are handled as different jobs. Is it reasonable to implement a global skip counter or is there a different way to handle this?
@sourmilk01 I think 7ab83743 combined with c693db5b and dfe1e09d solves your problem.
parent-skip
to share the skip counter between parent and child
(e.g. skipping 3 on reddit and 2 on imgur would count as 5 skipped files)skip: terminate
(or -T/--terminate
) to let the stop signal bubble up from child to parent
(e.g. reaching 5 skipped files on imgur would also stop the parent reddit extractor)@mikf Wow! Thank you so much!
I just tested it myself, and parent-skip
and --terminate
are working on reddit as intended; I think my scrape speed is a fifth or maybe even a tenth of what it was before.
@razielgn You should try testing it on behance.
Works great, thank you @mikf!
@sourmilk01 Is there any specific reason for not using the archive file option here?
@Hrxn the problem here isn't detecting an already downloaded file, but gallery-dl's action when finding one in combination with parent and child extractors, e.g. Reddit and Imgur. Any skipped download on one Imgur URL didn't propagate to its parent or other children and didn't count towards the overall "skip limit". Hitting said "skip limit" on an Imgur URL also wasn't able to halt the download for its Reddit parent, only itself.
I've noticed that using
--abort
with a site that has uses different image hosts (such as reddit with reddit, imgur, gfycat, redgifs content posted) will cause the--abort
feature to get interrupted before it hitsn
if it switches to a different extractor (e.g.--abort 5
and 4 repeated reddit posts are skipped, but then a repeated imgur post gets skipped and it resets).I haven't tested it yet, but I suspect that even if 5 posts of the same time that isn't the parent-extractor (like imgur posts on a subreddit url) are skipped,
--abort
won't apply because they aren't reddit-extracted posts.Is there a way to have
--abort
ignore what the type of extractor is being used and if not, could that feature be added?