IDs Added to Download Archive Despite Being Skipped

mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites

GNU General Public License v2.0

10.68k stars 881 forks source link

IDs Added to Download Archive Despite Being Skipped #5784

Open Drakovek opened 4 days ago

Drakovek commented 4 days ago

I've discovered that if gallery-dl skips downloading a file due to it having the same filename as an existing file, it will still add the media ID to the configured download archive. The issue isn't with the download being skipped (that's expected behavior), but that now the archive file contains IDs for media that was never actually downloaded.

You can recreate this behavior by adding an entry for an extractor in your config file with a static filename and and a fresh archive file like so:

"newgrounds": {
    "filename": {"": "AAA.{extension}"},
    "archive": "/home/[USER]/Downloads/test.sqlite3"
 }

Then try to download an artist gallery. It will only download the first image since all the filenames would be the same, but if you check the archive file in a sqlite database viewer, it will contain the IDs for every image in the gallery, despite none of them actually being downloaded.

I encounted this issue when downloading galleries from Newgrounds with my filename options set to use the post ID and title, only to later realize that posts with multiple images would use the same filename and thus hadn't downloaded properly.

This wouldn't be a huge issue, since I noticed my mistake and changed my file naming scheme, but now I can't just download the missing files from the gallery, because gallery-dl thinks it already downloaded them. And since I've got a ton of files and manually checking which ones didn't actually download is impractical, my only real option is to delete the archive file and download the gallery again in its entirety, which is what I wanted to avoid doing by using archive files in the first place.

biggestsonicfan commented 4 days ago

I feel that is by design, and not a mistake. Gallery-dl won't overwrite the file that already exists, and because it's going past that point in the gallery it adds an entry in your archive. I've encountered this issue with newgrounds before, but configuration options aren't usually a gallery-dl issue but a user issue. If you want to download an entire person's gallery as AAA.jpg that is not the fault of gallery-dl that you only end up with one file.

There are also cases in which a single post on other services can have identical filenames despite being different files altogether.

Drakovek commented 4 days ago

I understand why gallery-dl won't overwrite existing files and skips them instead, but I can't imagine any situation where one would want the ID to be added to the archive file even though it didn't actually download. It's not a log file. The whole point of the archive file is let gallery-dl know which files have been downloaded, so if it wasn't downloaded, it shouldn't be in the archive file.

The setting every filename to AAA.jpg was just to show the extreme end of what can go wrong here. If you only allow one filename, gallery-dl should only download one file. That's fine. But it also tricks gallery-dl into thinking it downloaded the entire gallery even though it skipped it all. I don't think that's intended behavior. I think that's an oversight.

biggestsonicfan commented 4 days ago

But it also tricks gallery-dl into thinking it downloaded the entire gallery even though it skipped it all

This depends entirely on your archive-format config setup. If your archive-format is set identically to your filename format, then yes, it will absolutely think you have downloaded that file already. But if your archive format is something like {hash}, and you download one 1.png file, the second 1.png file will not be downloaded, but the hash will not be added to the archive as it did not download that file.

EDIT: With a fresh gallery-dl config and using an sql archive and only downloading filenames as {filename}.{extension}, I downloaded this post (nsfw) and gallery-dl grabbed 8 out of 10 files. The last two 2.png files were skipped and ... oh

Yes, it does appear to be that there are 10 entries in the sql database.

Interesting still, changing the archive-format changes nothing. There are 10 hashed entries with 2 files it did not process.

Drakovek commented 4 days ago

Yeah, I was about to comment the same thing, archive-format doesn't change anything, even if it's totally unique and separate from the filename. If anyone else wants to take a crack at it, here's an example I hope will clarify the issue.

Using this config for the Newgrounds extractor:

"newgrounds": {
    "filename": {": "[{index}] {title}.{extension}"}, 
    "archive": "/home/[USER]/Downloads/test.sqlite3",
    "archive-format": "_{subcategory}_{index}_#{num}"
}

I used gallery-dl to attempt downloading this post, which contains three images. Since they all share the same index and title values, only one file is downloaded with the name [5812330] Samus Beyond.png

However, checking the test.sqlite3 archive file I specified, it contains the following entries.

newgrounds_image_5812330_#0
newgrounds_image_5812330_#1
newgrounds_image_5812330_#2

Which are the correct IDs for each of the three images in the format specified, even though only the image corresponding to ID newgrounds_image_5812330_#0 was actually downloaded.

I don't think this is a huge issue, since it can be avoided by taking more care with your filename config. Now that I noticed the duplicate name problem and updated my config file, I shouldn't run into this problem again even if this issue goes unsolved.

But it is annoying and can make it difficult to download missing gallery images even if you realize a mistake in your filename formatting.

mikf commented 3 days ago

A few years ago, a user complained that skipped files don't get added to the archive (#550), so this functionality got added (https://github.com/mikf/gallery-dl/commit/b5243297ffe303fd3b0a9ef8c14094a20717f42f).

Guess I'll add something like an archive-event option where you can select when to write archive IDs, like file,skip.

Drakovek commented 3 days ago

Huh. Can't say I understand why someone would want that behavior, but everyone has their own unique use case, I suppose. Maybe they wanted to be able to repopulate an archive file by downloading into a folder with existing images?

In any case, thank you for taking the time to look into this. The potential problem this causes is something that can largely be prevented by just taking more care with one's file naming format, so I understand if this isn't very high priority.

Hrxn commented 3 days ago

Yeah, I was about to comment the same thing, archive-format doesn't change anything, even if it's totally unique and separate from the filename. If anyone else wants to take a crack at it, here's an example I hope will clarify the issue.

I don't see how archive-format is the issue here.. it works as expected?

Using this config for the Newgrounds extractor:

"newgrounds": {
    "filename": {": "[{index}] {title}.{extension}"}, 
    "archive": "/home/[USER]/Downloads/test.sqlite3",
    "archive-format": "_{subcategory}_{index}_#{num}"
}

It's filename that is problematic here..

I used gallery-dl to attempt downloading this post, which contains three images. Since they all share the same index and title values, only one file is downloaded with the name [5812330] Samus Beyond.png

However, checking the test.sqlite3 archive file I specified, it contains the following entries.
newgrounds_image_5812330_#0
newgrounds_image_5812330_#1
newgrounds_image_5812330_#2
Which are the correct IDs for each of the three images in the format specified, even though only the image corresponding to ID newgrounds_image_5812330_#0 was actually downloaded.

As expected. The concept of one post - multiple images is very common, and gallery-dl reflects that. You should think of it like this: index is like the post ID here, title is pretty self-explanatory (title of said post), and extension is obvious as well.

The gist of it is this: You should always use {num}, unless you know what you are doing.

Drakovek commented 3 days ago

Again, I understand why the files weren't downloaded. I already updated my config file to use both the {index} and {num} in my filenames before I posted this, so I won't run into this problem anymore.

My issue is that gallery-dl added the IDs of all the images that weren't downloaded to the archive file, so even though I noticed my mistake, I can't go back and download the images I missed without deleting the archive file and redownloading everything.

Which is what I did, so everything's solved on my end, but I'd like to prevent others into falling into the same trap.

Hrxn commented 3 days ago

My issue is that gallery-dl added the IDs of all the images that weren't downloaded to the archive file, so even though I noticed my mistake, I can't go back and download the images I missed without deleting the archive file and redownloading everything.

If you keep the same path settings (i.e. base-directory and directory) together with an identical filename setting, you can redownload missed files, skip existing files (-o skip=true) while temporarily turning off the archive on the command-line: -o archive=""

Which is what I did, so everything's solved on my end, but I'd like to prevent others into falling into the same trap.

Granted, Newgrounds is a case where the default should probably be changed.

https://github.com/mikf/gallery-dl/blob/8bb793e21b28e280970af84bd76e41c7462898bf/gallery_dl/extractor/newgrounds.py#L18-L23

Historically, {num} was only missing in extractors for some specific sites, like booru-style ones, where the assumption holds true that one post - one image, until we had the case - if my memory saves me right - where such a booru site started to break "tradition" and had more than one image per post, and {num} was retroactively added to this site (and similar ones) in that case..

Drakovek commented 3 days ago

I appreciate your attempt to help, and that works if I were keep all my downloaded files in the same place I downloaded them, but I do a lot of sorting images into folders after they're downloaded, as well as archiving images and text as .cbz and .epub files.

Basically a lot of non-standard stuff that I would never expect gallery-dl to keep track of, but that's why I care so much about keeping the archive file accurate. The archive files are my only way of keeping track of what I have and haven't downloaded.

My use case is admittedly unusual (and it's definitely my fault for not noticing a bunch of files missing before I started shuffling things around). Frankly the longer I continue this conversation, the less sure I am that this issue is all that relevant to most people besides me, but I will still argue that some people definitely rely on archive files, so they shouldn't contain phantom downloads, if at all possible.

Though probably an even lower priority issue than I had originally thought.

mikf commented 1 day ago

archive-event option added, and I've also changed it to no longer write IDs of skipped files by default. It probably did more harm than good and is kind of a special case anyway.