mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.39k stars 930 forks source link

What happens if I run with `abort: 3` after a download for a page was interrupted, and more than 3 new uploads have been posted? #1283

Closed github-account1111 closed 3 years ago

github-account1111 commented 3 years ago

Will it download only the new uploads since the interruption, or will it also finish downloading the older posts that were not downloaded previously due to the interruption? Is that accounted for in the archive?

wankio commented 3 years ago

it will abort after 3 times skipped no matter what

github-account1111 commented 3 years ago

Interesting. Can I make it so it doesn't? I have quite a few pages that have interrupted in the past due to various errors (both caused by me and that were out of my control) and I want those to be fully downloaded eventually. But I don't want to run with abort:false because that would temporarily ban me out of some sites due to too many requests (which was incidentally one of the reasons for the aforementioned errors).

mikf commented 3 years ago

You can use -o skip=true to switch to the default behavior (skip already downloaded files, don't abort) for that specific URL, and also -o sleep-request=… to set a delay between HTTP requests so you hopefully don't get banned.

Depending on the site, it might also be possible to quickly skip over a large chunk of already downloaded files without hammering the site with (useless) HTTP requests by using --range, but that is only possible for a select few sites.

github-account1111 commented 3 years ago

Oh sorry, by abort:false I actually meant skip:true. The former doesn't even exist haha.

for that specific URL

The problem is I have multiple ones and don't know which ones they are. Is there a way to reliably find them?

Also, something I keep wondering is what is the difference between sleep and sleep-request? For some categories I have both. Searching issues doesn't seem to yield the answer.

mikf commented 3 years ago

No, not that I can think of. The only way is most likely going to be running gallery-dl with all URLs again.

sleep simply waits X seconds before each file download, but it won't wait when a file download gets skipped. This option was added fairly early on to have an equivalent to youtube-dl's --sleep (4fb6803f). sleep-request makes sure at least X seconds since the last background HTTP request have passed, and sleeps if necessary. It doesn't care about file downloads or skipped downloads. I think it should generally be preferred over sleep, but the old option is still there because backwards compatibility.

Butterfly-Dragon commented 3 years ago

@github-account1111 you could do as i do: do 2 downloads.

Basically, select the single gallery where you had the error. and run that with a different config file that has "skip:true"

Aaand slowly go through the list. A few at a time.

The only other way to do this is to run a simlulation that only downloads the page galleries and then check which ones have undownloaded stuff But if you are worried of bans, that is also unadvisable.

Also i don't remember if gallery-dl has a function for that.

In general though. It's better to not hoard a ton of galleries because you then get 2 things: "fear of missing out" and also "sensory overstimulation".

The first is the nagging feeling you're missing something. The second is the kind of stuff where art just does not work anymore because you're waded through a ton of it.

github-account1111 commented 3 years ago

@mikf Just reran all the URLs with skip:false like you suggested, and it was a lot quicker than I originally thought! No bans either, even though I completely removed all the sleep and sleep-request flags from my conf file. Might be thanks to some categories having a default minimum sleep value now.

github-account1111 commented 3 years ago

@Butterfly-Dragon We're talking hundreds of URLs here, so doing it that way might take a couple months and a ton of manual labor.

This is mostly for archival purposes. I might not ever see most of those downloads. But stuff gets deleted very frequently, and I used to get very upset in the past when, for instance, visiting one of my YouTube playlists only to discover a third of it is gone.

If anything, this takes care of FOMO, because I now know that since the script runs periodically, I don't have to visit any of those pages anymore.

Butterfly-Dragon commented 3 years ago

Wait... so you are going through hundreds of urls to download (at least) thousands of videos?!?!

Uhm. Okay? I don't see the purpose of archiving stuff you will not have the physical time to appreciate. πŸ˜…

But sorry if i intruded.

github-account1111 commented 3 years ago

Wait... so you are going through hundreds of url

I mean I'm not. The script is. That's the whole point haha

If I just wanted to download a couple pages I'd have probably just done it manually instead of figuring out how a new cli tool works.

thousands of videos?!?!

Photos and videos. It's a mix of websites using different formats. E.g. artstation only has pics, youtube only has vids, and instagram has a mix of both.

I don't see the purpose of archiving stuff you will not have the physical time to appreciate.

There is a chance I will. Just like in the example with youtube playlists, if something is deleted from the Internet, I will still have access to it. That's the whole point of archival (check out the Wayback Machine, for instance). Storage is cheap nowadays, so I don't see why not.

github-account1111 commented 3 years ago

Your amusement is justified though. It can sound pretty weird. There's not necessarily a rationale behind every single part of this. Like I said, this is to an extent psychological in that it's my way of coping with FOMO. If it weren't for this, I imagine I'd spend a lot more time browsing those websites than I currently do (which is fairly infrequently).

Butterfly-Dragon commented 3 years ago

i ... honestly use gallery-dl because my connection sucks and this way i have to download far less stuff and it gets done quicker πŸ˜…

aleksusklim commented 2 months ago

@mikf, can we download posts in reverse order? Firstly get all links (memory-costly without --abort or --terminate of course) and then reverse the list to download backwards? So that the problem of this post won't hurt anymore: as long as each time gallery-dl is able to list all new stuff and download a few files ("from the end of the gap at the beginning"), any next run will update instead of leaving holes, no matter the actual abort value.

Is this feature implemented or planned?

Butterfly-Dragon commented 2 months ago

Is this feature implemented or planned?

how is that any different than running without abort or terminate?

aleksusklim commented 2 months ago

how is that any different than running without abort or terminate?

Downloads are happening from recent to older. Each time it must list ALL, even if downloading nothing. To make it not list all, we can use abort or terminate. Which might leave a gap as described in this Issue because more recent downloaded files but abrupt termination of the process would make gallery-dl think "up to date" even if something is not retrieved in the middle.

Am I right?

aleksusklim commented 2 months ago

I mean, not LISTING from older to newer, but keep listing from newer to older, but cache the list until we hit the abort/terminate condition, and THEN reverse and download.

I should have been more clear about this, thank you.

Butterfly-Dragon commented 2 months ago

if you do not abot nor terminate you keep downloading and since you need to check the URL anyway, it changes nothing, it is just slower.

The only solution i see of the problem above is that the SQL file lists the stored URLs and also lists a previous state of the gallery and the last known download state of the gallery.

If the last known download state does not list as having reached the previous known state (because the program was abruptly terminated) and the old download state lists the gallery as having been fully downloaded then it does not keep trying to download past the last known download.

Otherwise it could keep downloading because of "anomalous interruption".

If the gallery was never fully downloaded and a known state of the gallery as "fully downloaded" is not reached (with 3 overlaps because "abort:3") then you keep downloading, never aborting nor terminating.

This means the SQL needs to record which images belong to which gallery and this is not done everywhere AFAIK. It might require a full re-check (but not redownload) of everything so that it writes which image belongs to who and which gallery is complete at which images downloaded and then write all new image downloads with that same protocol.

This would allow to not have "gaps" in case of a computer going down mid-download of a gallery (something that happens quite often *sigh* )

Reverse image download would only make sense for galleries like "Tapas comics" which add the new chapters at the end of the gallery.

But, again, it is just faster to use "skip = true" (which is the default, or "do not abort nor terminate") for those edge cases.

aleksusklim commented 2 months ago

Again. Correct me if I'm wrong, but:

My sample command-line: gallery-dl --config-ignore --write-metadata --write-info-json --cookies-from-browser firefox --download-archive gallery-dl/pixiv/ID.db https://www.pixiv.net/en/users/ID

Each time it lists everything only to find that nothing has to be downloaded. It has to read the whole gallery each time I run this.

If I would add, --terminate 50 – then it would stop listing after a few "pages", downloading some new posts and then realizing that "next 50 files were already downloaded, so the job is done"

Is this right? If it is, then imagine:

  1. I downloaded everything.
  2. The artists posted 200 new works.
  3. I run my command to update.
  4. After it downloaded 100 files the internet disconnects (assume I didn't notice this, because this was just one artist from the batch file with tens of them)
  5. The next time I run my job it would immediately see 50 downloaded files and stops, not realizing the hole that was left later!

Is this correct? (And yes, re-fetching everything would solve that, so periodical full retrieval will be needed)

If so, then why do you think inversion of the download order won't help? It fetches 250 links (200 new + 50 existing), starts downloading from 199 back to 0, disconnects at 100; then the next time it would list 150 and continues roughly from 99 to 0.

What am I missing here?

Butterfly-Dragon commented 2 months ago

Because that is the same as writing down that one artist (or checking which artist was working on in the log or on the files as sorted by "most recent"), starting download with that gallery and telling it to not abort/terminate.

Your system not only checks ALL the URLs in that gallery but also does it twice: first in one way then the opposite way as it downloads.

And it still solves nothing because what if the gallery is 250 images you have "abort 50" and you downloaded both the first 51 and the last 51 images? you are still missing 148 images.

Telling it to not terminate downloads on that one gallery is just the easiest way to do it without rewriting how SQLs are handled and how the abort/terminate is handled.

aleksusklim commented 2 months ago

what if the gallery is 250 images you have "abort 50" and you downloaded both the first 51 and the last 51 images?

It should LIST from the start "as now", stop when saw enough downloaded ("as now"), but then start fetching them, from oldest to newest.

Basically, currently algo is roughly this:

  1. Fetch first page
  2. Parse all links from it
  3. Download everything in order, discarding existing files
  4. Stop if abort/terminate fulfils for last N files
  5. Fetch the next page and repeat until all pages parsed

This can be seen as an optimized version of this one algo:

  1. Fetch first page
  2. Parse all links from it
  3. Discard existing files
  4. Break the loop if abort/terminate
  5. Otherwise, store all links in memory and repeat for next page until all parsed
  6. Now, we have a list of links, download them in order

So what am I proposing is reversing the list before the last step here!

Currently we parse and download sequentially, I suggest to parse all what we are UP TO download, and then start downloading, but in reversed order.

The only serious issue I can imagine is some kind of link expiration, eg. the very first one because it would be downloaded potentially long after it was fetched. But this is not the case for most of extractors, and thus is not a stopper.

aleksusklim commented 2 months ago

Oh, "first 51 + last 51" is impossible if using terminate + reverse correctly, since reverse would never make a gap. If you want to intentionally create such gap, then just the first 51 would be enough, since my proposed method would stop right away, listing from the head and never checking the tail.

Butterfly-Dragon commented 2 months ago

you still have not answered how is that different from "do not abort/terminate" aside from being slower. You keep focusing on the minutiae as if they mattered. If you do no "abort/terminate" then you download everything you did not download leaving no gaps. But it requires a single pass rather than multiple because a gallery of even as few as 250 images usually requires 4 main pages which you have to download and do BUT ALSO you are asking the thing to ignore the first few because there was an anomalous termination or whatever which means you have gaps. What if you have multiple gaps? what if you have... basically you want something which is edge case at best and i can see working only on galleries which add new images at the very end of a gallery. And even then just setting that gallery to just not abort/terminate is faster than what you are proposing even when you have to download no new images.

aleksusklim commented 2 months ago

What are you talking about? Here is an illustration with 5 images per page.

Page 1: image15, image14, image13, image12, image11; Page 2: image10, image9, image8, image7, image6; Page 3: image5, image4, image3, image2, image1;

For example, I have abort=3. First pass will: Fetch page 1, download images 15, 14, 13, 12, 11; then fetch page 2 and download images 10, 9, 8, 7, 6; then fetch the last page and download 5, 4, 3, 2, 1.

Imagine there are 4 new images. Now it looks like this:

Page 1: image19, image18, image17, image16, image15; Page 2: image14, image13, image12, image11, image10; Page 3: image9, image8, image7, image6, image5; Page 4: image4, image3, image2, image1;

If will fetch page 1 downloading 19, 18, 17, 16 but skipping 15; then it would download page 2 and skip 14, 13, then abort because 3 were skipped already.

So far so good. Now imagine 6 new images:

Page 1: image25, image24, image23, image22, image21; Page 2: image20, image19, image18, image17, image16; Page 3: image15, image14, image13, image12, image11; Page 4: image10, image9, image8, image7, image6; Page 5: image5, image4, image3, image2, image1;

It fetches page 1, downloading all 25, 24, 23, 22, 21, then fetches page 2 but assume the internet disconnects before the download of image 20 starts, ruining the job.

User restarts later and here what happens: it fetches page 1, skips 25, 24, 23 as already downloaded and aborts! Never fetching the page 2 and failing to grab image20 forever.

Now, my algo from the start (3 pages, 15 images) will do this: Fetch page 1, parse links: nothing is skipped ever, so continue to fetching pages 2 and 3 keeping links. We have links from 15 to 1, starting download in reverse order: 1, 2, 3, 4… 13, 14, 15.

First update (4 pages, 19 images): Fetch page 1, parse links. Add 19, 18, 17, 16 but skip 15; fetch page 2, skip 14 and 13; abort condition is met, so starting to download 16, 17, 18 and 19.

Second update (5 pages, 25 images): Fetch page 1, parse links. Add 25, 24, 23, 22, 21. Fetch page 2, add 20 but skip 19, 18, 17 and thus about. Start downloading in reverse order: 20, 21, 22, 23, 24, 25.

– No matter on which file the internet disconnects, there won't be a "gap" inside the sequence, it always grows backwards monotonically.

If I would not use abort, then the last case will be: Fetch page 1, download 25, 24, 23, 22, 21; fetch page 2 download 20 and skip 19, 18, 17, 16; fetch page 3 and skip 15-11; fetch page 4 and skip 10-6, fetch page 5 and skip 5-1.

My method won't cause unnecessary page fetches (which is a huge problem with artists that have thousands of images) and at the same time guarantees that it never misses anything if you always enable it.

Butterfly-Dragon commented 2 months ago

okay. now validate by showing how any of those scenarios is better than just fetching all in speed, resources and/or efficiency.

aleksusklim commented 2 months ago

I can run the script fetching 200+ artists each day, and at best it would make just 200 requests instead of 200 times number of pages (per each artist accordingly), which could be huge for some artists.

I already hitting pixiv flood limit 3 times per "just skip normally" run, for example.

Don't tell me increasing timeouts, don't tell me running the script rarer, and don't tell me raising the abort value. All of those are WORKAROUNDS, while reversing the download sequence is the solution!

aleksusklim commented 2 months ago

The high abort value is a trade-off between "fetching too many pages unnecessary" and "high chance to miss something somewhere". While the reverse download will guarantee the correctness and not fetching more than one page if nothing new was found.

Also, the rarer script runs are, the bigger should be the abort value, meaning if you want to keep it low (in 1-2 pages), you would have to schedule the whole job to run more often, so that no more than 2 pages of new content would appear for any artist – to be 100% safe in all cases.

Butterfly-Dragon commented 2 months ago

You are clearly under intense stress and FOMO. FOMO is bad and you should get treatment for it.

I download daily 1500+ artists and i rarely see more than 20 images being added daily. Except for the AI """artists""".

Your screams of "i will lose something somewhere" are obvious sign of FOMO turning into useless panic.

This is not dismissal, it is concern. Get help.

That said: For pixiv i suggest the dedicated downloader https://github.com/Nandaka/PixivUtil2/releases which prevents a lot of your problems as gallery_dl is a generic downloader and falls short of dedicated ones.

Pixiv artists do have the tendency of adding stuff like 15 images all at once like a "manga" that utility allows you to scour those pages unimpeded and check all the artists, from pixiv alone i have 300+ artists and at best all i ever got was reduced download speed with that utility, when rebuilding an archive.

aleksusklim commented 2 months ago

I download daily 1500+ artists

So you need to either have low abort, or reverse downloading. Period.

Butterfly-Dragon commented 2 months ago

reverse downloading is just worse forward downloading without abort and yes i keep abort to 5 at most except for TAPAS webcomics which add the new stuff at the very end so i set those to specifically download without skipping.

aleksusklim commented 2 months ago

It may or may not be worse considering what is faster in every particular case: download all needed pages and then all needed images, or take pages by one and images between them. When no/few new content is added (1 page) – the speed is equal, only the order is different (making sure there will be no gaps).

When a lot of content is added – the abort value would be respected anyway, and the total speed will be the same. I think you are still misinterpreting what "download in reverse order" means here. It will still read pages from the first one, and abort just as now! But the actual content download would be started AFTER the abort condition takes effect (or after all pages are fetches).

We are not wasting anything, downloading the same pages and the same images, just saving this limited set of images from older to recent.

Butterfly-Dragon commented 2 months ago

With skip enabled you just recover the url to check if it was downloaded which you need to do anyway. The nornal way just downloads anything missing immediately rather than building the image gallery tree first then download what's missing. in most cases building a full image tree is impossible due to 429 blocks.

aleksusklim commented 2 months ago

You don't need ALL the links, you will get only missing ones and then abort hits.

Are you still not getting the point? The difference between my method and just using abort normally is that no gaps would be ever created (but for any other matters like speed or efficiency it is the same). The difference between my method with abort and fetching without abort is that my method would fetch only needed pages and not all of them (as suggested "just retrieve everything without abort" would do).

Butterfly-Dragon commented 2 months ago

you need all of the links to know which to discard. there is no "future reading" + "i will not need this" feature that lets you discard stuff you do not need before checking what it is to tell you do not need it.

Reading backwards stilll requires you to get all the links in a page to go to the end to see if there are more pages.

aleksusklim commented 2 months ago

Currently:

  1. Enable abort/terminate
  2. Download a gallery with a few new posts
  3. Do you see it has fetched just one or two pages, but not all?
  4. Do you see files are downloaded from newest to older?
aleksusklim commented 2 months ago

I presume you don't understand how reverse downloading will "fill the gap" if it can't know about it? It cannot. Instead, it will prevent future gaps!

If you already made a gap somehow (by not using reverse downloading, for example) – then the only way to fix it is to run without abort. But if all your subsequent retrieves would be with both abort and reverse downloading – then there would be no FUTURE gaps, ever. Because each started download saves the oldest file from those that weren't downloaded yet.

And it perfectly knows which one exactly, just as abort does it currently.

Butterfly-Dragon commented 2 months ago

So. "Forward" downloading (without abort/terminate)

If you see a (few) artist(s) with a(/some) broken download(s) you just write down that artist(s) and do a special forward download just for them.

That is literally it.

It's not like it happens constantly, it can only happen once every time you do a full remap of an artist's site.

If you left the PC doing something you know when it restarted at which point you just write down which artist's folder is the last one by telling it to sort folders by "modified" and you tell it to check that artist specifically.

otherwise it's handled by infinite retries or other systems already in place.

aleksusklim commented 2 months ago

What is "mapping the site"?

aleksusklim commented 2 months ago

for no advantage over just "not skipping stuff"

Each run of "downloading without abort/terminate" will fetch ALL pages of the artist, no matter if it really had new stuff or older gaps.

On the other hand, using abort/terminate will fix all of the above EXCEPT for the possibility to leave gaps! The reverse downloading does not introduce any drawbacks regarding user experience or website load, while still having all benefits of the abort (provided you'd set it) and getting rid of POSSIBLE gaps in-between.

What you suggested is a workaround: "if gallery-dl failed, the do a full run" (or estimate where the gap is and change the abort value, etc.) Reverse downloading will make such workarounds unnecessary and obsolete, rendering gallery-dl more robust on all cases in long-term.

If by "mapping the site" you'd meant "store the list of links before starting the actual download", then yes, but this stage is cheap in term of the resource usage (especially if the abort value is low). The only reason why this could be hard to implement is the intermediate code complexity, regarding a pluggable infrastructure of gallery-dl codebase (for example, I've tried to find a place in the source code where I could actually add the list of links instead of downloading them right away – but could not, since I have to carefully study the internal mechanics of the project to make this future-proof and compatible with ALL of extractors, which is not trivial).

If @mikf would tell, "I cannot do this due to complexity for now", I can perfectly understand. But not because my suggested feature is not useful or wrong.

aleksusklim commented 2 months ago

Oh, I have another fair idea!

How about a new argument, like --no-abort-recent TIME that takes a time span in seconds (default to 0 which retains current behavior). When set, any file with creation date or modification date larger than "current date - TIME" will not count for abort/terminate.

Meaning, you set it for a period roughly "from your previous run", for example 604800 (60x60x24x7) is one week – and so any file that was updated no more than a week ago would NOT be taken into account for abort.

This way, anything you might have been downloading recently – will be still re-fetched anyway (but not re-downloaded, just as now), even if nothing changed. For artists that had no new works for months – this will do nothing, since everything would count for about (which you can set pretty low now).

But for those who posted, e.g., yesterday 100 new pictures – you will fetch those pages again, even if your abort is much smaller. The download would stop anyway, after fetching older pages that will trigger the abort normally.

This will effectively prevent gaps as long as you keep re-running the job in case of errors (or just to be extra sure), because even if a previous download fails – the next one will retry that exact file, since nothing from newly downloaded counts for about.

How that sounds? This would be much easier to implement currently.

Butterfly-Dragon commented 2 months ago

open a new case for this this was closed so it is probably unfollowed by anybody else. But yeah that's a more sensible way of dealing with crashed/rebooted PC in the middle of a download.

aleksusklim commented 2 months ago

Oh, there is a thread: https://github.com/mikf/gallery-dl/issues/5255

aleksusklim commented 2 months ago

Wait, there is also tackling with the archive: https://github.com/mikf/gallery-dl/commit/fd734b92223a02c0c392e4eece6bf82ba0da1fc8 I should try this...

aleksusklim commented 2 months ago

Hmm, it didn't help: even with --terminate 5 -o archive-mode=memory when I hit Ctrl+C after 6th file – then next run stops after seeing first five.

I think this is mainly because the files are there, and gallery-dl relies on their existence and not on the archive.

UPD: Maybe if I would automatically move all files away, but leave the archive there... Yeah, that might do the trick! Oh wait, it would redownload them in this case, sigh.

aleksusklim commented 2 months ago

Wow, this is better: -o "skip-filter=((datetime.today()-date).days>7)" It might actually work…