mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.36k stars 925 forks source link

Connectionpool establishes connection to file after information is grabbed and skip is true? #2603

Closed biggestsonicfan closed 9 months ago

biggestsonicfan commented 2 years ago

Greetings,

So I happened to download a lot from kemonoparty with the following filename stetting {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}. It wasn't until yesterday that I realize certain patreon creators are absolutely insane inserting files with the same filename (ex. "2.png") in a single post, which resulted in gallery-dl skipping the identical filenames (skip set to true in extractor config).

In order to avoid redownloading the entire kemono folder of data I already have, what I have done is setup a preprocessor rule to occur before downloading to check if the old filename exists and equal to the hash returned by gallery-dl:

extractor config:

        "kemonoparty":
        {
            "archive": "/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3",
            "sleep": 20.0,
            "sleep-request": 20.0,
            "cookies":
            {
                "__ddgid": XXXXX,
                "__ddgmark":XXXXX,
                "__ddg2": XXXXX,
                "__ddg1": XXXXX
            },
            "metadata": true,
            "new-directory": ["kemonoparty-v2", "{service}", "{user}—{username}"], 
            "directory": ["kemonoparty", "{service}", "{user}—{username}"],
            "old-filename": "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
            "filename": "{id}_{title}_{num:>02}_{filename[:180]}.{extension}",
            "archive-format": "{service}_{user}_{id}_{num}",
            "postprocessors": ["embeds", "deupe"]
        },
    "postprocessor": 
    {
        "embeds": 
        {
            "name": "metadata",
            "event": "post",
            "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
            "mode": "custom",
            "format": "{content}\n{embed[url]:?/\n/}",
            "filter": "embed or 'http://' in content or 'https://' in content",
            "directory": "metadata"
        },
        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/XXXXX/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}"
                ],
            "event": "prepare"
        }

    },

where verify_file.sh is as follows:

#!/bin/bash

if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
    echo "Checksum failed. File will download or be skipped if it exists." >&2
    exit 1
else
    echo "Checksum matched! Renaming $2 to $3" >&1
    mv "$2" "$3"
    exit 0
fi

I began executing this new configuration on the artist where I first noticed the post had two different "2.png" files in it (nsfw link). However, while things seemed to be going fine, I encountered this in my output:

/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56680904_Chel - Road to Eldorado (Patreon Poll winner)_04_ChelPlug.png
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/19911192—56390374—a—2021.09.20—Nude.png to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/19911192—_Asura/56390374_OC Notorious Ivy commission_01_Nude.png
downloader.http: '403 Forbidden' for 'https://kemono.party/data/68/06/68060d55bdcada9cf2d7678381914ed6d43ac3dd1ded2632ccfb8c62a2d89180.png'
download: Failed to download 56390374_OC Notorious Ivy commission_01_Nude.png

Apparently, the (pre)postprocessor runs correctly moving the file because the checksum matched, however skip:true must not be checked after the (pre)postprocessor and before the download starts, because the (now) identical file exists in it's place yet it's redownloading the data anyway.

What I am doing now to get around this is to use the --no-download flag to just rename the files. I will still have to rerun gallery-dl to download the missed items that I did not already have downloaded.

I am still wondering if I just have a config issue or if this might be an oversight in the order of operations gallery-dl uses.

EDIT: Apparently --no-download is still writing to the archive, so I can't create/use an archive file until I process all the renaming issues...

mikf commented 2 years ago

however skip:true must not be checked after the (pre)postprocessor and before the download starts

Whether to skip a download or not gets checked after event: prepare post processors run: 1) build path 2) run "prepare" post processors 3) check archive 4) check filesystem

https://github.com/mikf/gallery-dl/blob/688d6553b4dd009ddb1a2e50b6d45c9177a782ed/gallery_dl/job.py#L217-L233

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

biggestsonicfan commented 2 years ago

Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.

While I understood the error was unrelated to the postprocessor, my initial confusion is why an attempt at the file was made when the filesystem check returned true, yet an attempt was made to access the data anyway, even though all information regarding the file was already provided. After reading your response, I initiated (something I should have done earlier) a verbose ouput of gallery-dl:

gallery-dl: Version 1.22.0-dev
gallery-dl: Python 3.8.13 - Linux-5.17.7-1-default-x86_64-with-glibc2.34
gallery-dl: requests 2.27.1 - urllib3 1.25.11
gallery-dl: Starting DownloadJob for 'https://kemono.party/patreon/user/15617066'
kemonoparty: Using KemonopartyUserExtractor for 'https://kemono.party/patreon/user/15617066'
urllib3.connectionpool: Starting new HTTPS connection (1): kemono.party:443
urllib3.connectionpool: https://kemono.party:443 "GET /patreon/user/15617066 HTTP/1.1" 200 6042
kemonoparty: Sleeping for 19.99 seconds
urllib3.connectionpool: https://kemono.party:443 "GET /api/patreon/user/15617066?o=0 HTTP/1.1" 200 4952
kemonoparty: Using download archive '/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3'
kemonoparty: Active postprocessor modules: [MetadataPP, ExecPP]
postprocessor.exec: Running '['/run/media/XXXXX/bfd18/dl/verify_file.sh', '1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip']'
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip
urllib3.connectionpool: https://kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 302 138
urllib3.connectionpool: Starting new HTTPS connection (1): data8.kemono.party:443
urllib3.connectionpool: https://data8.kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 200 30496560
/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip

Why are the GETs and new connections being established if the file was verified to be skipped given the conditions?

EDIT: I should also add that the color behavior is odd as well. Usually, when files are skipped, the color of the font is not changed. In this case, the colors are turning that seafoam green color as if a download happened.

AlttiRi commented 2 years ago

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

It has:

Total: 285 (358)
Attachments: 285 (285)
Files: 73 (73)
Embeds: 0 (0)
Inlines: 0 (0)

With

           "filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}_{id}_{filename}.{extension}"

gallery-dl downloads 285 files as expected.

285 unique files of 358 total (with duplicates). Because of gallery-dl skips dups based on the hash by default.

In fact this filename pattern is suited to download all 358 files. Set extractor.kemonoparty.duplicates to true and it will download all 358 files.


The same filename within one post may have only the file and one of attachments. So, it's enough to use just the first letter of {type} instead of {num}.

Show me an example, if I'm wrong.

biggestsonicfan commented 2 years ago

Is https://kemono.party/patreon/user/15617066 one of problems URLs?

No, it's unrelated and just an example of the verbose output to begin the rename process for that folder. The problem URL was given in my first post. The verbose output was merely to show connections being established to the server directly to the file even though gallery-dl had all the information needed to skip those connections, unless those connections are needed somehow even when skip is set to true.

I had verified with my old configuration, files with identical names were being skipped. Reprocessing the user with the new configuration renamed the files as well as grabbing the one it previously skipped. That is no longer the issue. The issue now is the creation of connections directly to the file even after all checks have passed for gallery-dl to skip that file even when it won't download it, it still establishes a connection to it.

AlttiRi commented 2 years ago

Fuck. There are cases when the attachments of one post can be with same name. Helpfully, it happens very rare. A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Downloading of this artist entire will have 3 unique (for the artist) images missed. (The artist has 252 unique images (265 with per post de-duplication), but currently it downloads only 262 (with this name pattern) and 372 total without de-duplication (344 are downloaded currently).)


Okay, the most trivial fix is to introduce a new key — unique_filename, or better — filename_num. In order to use it at the end of {filename}, for example, after trimming too long names {filename[0:140]}{filename_num:?_//}.

So, it's the proper usage:

"filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}{filename_num:?_//}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}{filename_num:?_//}.{extension}"

On the next run the missed images will be downloaded.

Well, but currently there are no filename_num key. @mikf could you add it? I will also edit my prev comments in this case.

Here is a JavaScript "pseudo-code":

const attachments = [...]; // URLs

const nameCounts = new Map(); // here
let num = 0;
for (const attachment of attachments) {
    const {filename, extension, hash} = parseUrl(attachment);
    num++;

    // and here
    let filename_num = null; // None
    const count = nameCounts.get(filename) || 0;    
    if (count > 0) {
        filename_num = count;
    }
    nameCounts.set(filename, count + 1);

    yield {filename, extension, hash, num, filename_num};
}

So, if some filename is appeared multiple time the associated file will have filename_num with 1, 2... value. By default it should be None. No prefix by default for the first file, and _1, _2 in case a filename collision for the next files.

The more correct key name is attachment_filename_num. It should count filenames per attachments, not per the post entirely.

UPD: Also it's better to use attachment_filename_num in my case after {type[0]}, or {title} (instead of after {filename}) to prevent theoretical filename collision (if a post contains, for example: 2.png, 2_1.png, 2.png attachments):

"filename": "[{category}] {user}—{id}—{type[0]}{filename_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{filename_num:?-//}_{id}_{filename}.{extension}"

Not ideally, but it will works fine. _(if attachment_filename_num could be introduced)_

That's all.

biggestsonicfan commented 2 years ago

A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600

Yup, that's the one in my first post as well and what lead to my discovery. I discovered it completely by accident too lol.

AlttiRi commented 2 years ago

I have fixed it the same way as it is in my JS example above:

In config add {_a_dup_name_num:?-//} after {type[0]} for "filename" and "archive-format".

           "filename": "[{category}] {user}—{id}—{type[0]}{_a_dup_name_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
           "archive-format": "{service}_{user}_{type[0]}{_a_dup_name_num:?-//}_{id}_{filename}.{extension}",

With this patch it works 100 % correctly. On next run you will download the missed files (attachments with duplicate names within one post).

+3 images with "duplicates": false, or +28 with "duplicates": true for https://kemono.party/patreon/user/19911192.

@mikf I'm not only one who uses {type[0]}, so I hope you will add this fix in any form.

biggestsonicfan commented 2 years ago

I just attempted to redownload a few images from a kemono gallery where some images were skipped due to time-out, but this time with the archive database skip enabled, and it absolutely zoomed through the already existing images.

Something is clearly wrong when using the filename skip.

biggestsonicfan commented 2 years ago

Where are both the filename skip and sqlite skip handled? I'm having trouble locating them.

biggestsonicfan commented 2 years ago

Okay, a way to absolutely confirm is something is wrong with filename skip is to do the following:

  1. Begin downloading an exhentai gallery without using the sqlite database
  2. Cancel the download at some point, and view your "Image Limits" in your e-hentai "My Home" tab
  3. Wait until the Image Limits resets to 0
  4. Start the gallery download again with filename skip enabled
  5. You will see your "Image Limits" usage increase even though you are not downloading any images
  6. Cancel the download
  7. Wait until the Image Limits resets to 0
  8. Create an entry for the sqlite archive in your config
  9. Begin downloading the gallery for a 3rd time
  10. Cancel and see your Image Limits get used again
  11. Wait until the Image Limits resets to 0
  12. Download the gallery for the 4th time
  13. Images that are skipped no longer contribute to Image Limit

This right there shows this is obviously a problem.

biggestsonicfan commented 2 years ago

I have had to modify both my "dedupe" post(pre)processor and verify_file.sh bash script as follows:

        "deupe": {
            "name": "exec",
            "async": "false",
            "command": [
                "/run/media/xxx/bfd18/dl/verify_file.sh",
                "{hash}",
                "{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
                "{_path}",
                "{_directory}",
                "{user}—{id}—{type[0]}—{date:%Y.%m.%d}—"
                ],
            "event": "prepare"
#!/bin/bash

readarray -d '' findarray < <(find "$4" -name "$5*" -print0)

if ! echo "$1  $2" | sha256sum --check --status --ignore-missing -; then
    #If it isn't, lets see if an old part file exists
    if [ -f "$2.part" ]; then
        #if the part file exists and the new file doesn't already exist then rename the part file
        if ! [ -f "$3" ]; then
            echo "Partial data '$2.part' exists exists moving to '$3.part'!"
            mv "$2.part" "$3.part"
        #But if the hash file matches the new data, we don't need the .part anymore
        elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
            echo "$3 hash matches new data, discarding old $2.part data"
            rm "$2.part"
        fi
    else 
        for i in "${findarray[@]}"
        do
            if echo "$1 $i" | sha256sum --check --status --ignore-missing -; then
                echo "Similarly named file found at $i with matching checksum!"
                if ! [ -f "$3" ]; then
                    echo "Renaming old data $i to $3"
                    mv "$i" "$3"
                elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
                    echo "Checksum matched, but $3 also matches checkum. Deleting old $i"
                    rm "$i"
                fi
            fi
        done
    fi
    sleep 1
    exit 1
else
    #If the new data doesn't exist and the checksum matches, rename it
    if ! [ -f "$3" ]; then
        echo "Checksum matched! Renaming $2 to $3" >&1
        mv "$2" "$3"
    #If the new data exists and the new data matches the checksum, delete the old data
    elif echo "$1  $3" | sha256sum --check --status --ignore-missing -; then
        echo "Checksum matched, but $3 also matches checkum. Deleting old $2."
        rm "$2"
    fi
    sleep 1
    exit 0
fi

The reason is that kemonoparty has, for a reason not known to me, changed filenames, so I can no longer trust the filename parameter passed to bash script to search for a pre-existing file and I need to recursively search based on other parameters.

I am now running my script through 504 already downloaded, incorrectly named, galleries, and many are detecting checksums and correctly renaming the file... however for larger files, such as .zip files.... it's downloading them.

If the prepare post processor really does execute in this order as stated above:

    build path
    run "prepare" post processors
    check archive
    check filesystem

The prepare post processor should have created satisfactory conditions for the "check filesystem" check for skipping, but it's just not...

biggestsonicfan commented 2 years ago

Okay, this postprocessor isn't working, as it's yeeting files into the abyss. The files "exist", and can be opened directly if you use the path, but ls and file explorers won't report them as being a file in their respective folder. Rebooting into Windows and running chkdisk seems to be the only solution.

Can you run a python script instead of command line, and also can the postprocessor actually finish before gallery-dl resumes? I thought that's what "async": false, but it seems to be random when it actually does wait or not.

EDIT: Time I learned find does not play well with exec EDIT2: I've re-written my script from the ground up, but it's still yeeting files into the void and only chkdisk can allow the index to see them once again.

biggestsonicfan commented 1 year ago

Alright, I rewrote the pre-postprocessor in python and it still was corrupting my index of my NTFS drive, and because I am on linux, I was using the newer Paragon ntfs3 driver and that apparently was not happy with the access and index rewriting. I changed the driver back to ntfs-3g and was able to remove around 260 Gigs of duplicate files.

biggestsonicfan commented 9 months ago

Most likely resolved with https://github.com/mikf/gallery-dl/commit/43d0c49d7ec3605a125f99719887d3a5fdb3c276 per https://github.com/mikf/gallery-dl/issues/2842#issuecomment-1823195141