Closed biggestsonicfan closed 9 months ago
however skip:true must not be checked after the (pre)postprocessor and before the download starts
Whether to skip a download or not gets checked after event: prepare
post processors run:
1) build path
2) run "prepare" post processors
3) check archive
4) check filesystem
Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.
Afterwards it proceeds to download, which for whatever reason failed with a 403 Forbidden error in your case. This error is completely unrelated to your post processor, by the way.
While I understood the error was unrelated to the postprocessor, my initial confusion is why an attempt at the file was made when the filesystem check returned true
, yet an attempt was made to access the data anyway, even though all information regarding the file was already provided. After reading your response, I initiated (something I should have done earlier) a verbose ouput of gallery-dl:
gallery-dl: Version 1.22.0-dev
gallery-dl: Python 3.8.13 - Linux-5.17.7-1-default-x86_64-with-glibc2.34
gallery-dl: requests 2.27.1 - urllib3 1.25.11
gallery-dl: Starting DownloadJob for 'https://kemono.party/patreon/user/15617066'
kemonoparty: Using KemonopartyUserExtractor for 'https://kemono.party/patreon/user/15617066'
urllib3.connectionpool: Starting new HTTPS connection (1): kemono.party:443
urllib3.connectionpool: https://kemono.party:443 "GET /patreon/user/15617066 HTTP/1.1" 200 6042
kemonoparty: Sleeping for 19.99 seconds
urllib3.connectionpool: https://kemono.party:443 "GET /api/patreon/user/15617066?o=0 HTTP/1.1" 200 4952
kemonoparty: Using download archive '/run/media/XXXXX/bfd18/dl/gallery-dl/sql/kemonoparty.sqlite3'
kemonoparty: Active postprocessor modules: [MetadataPP, ExecPP]
postprocessor.exec: Running '['/run/media/XXXXX/bfd18/dl/verify_file.sh', '1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip', '/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip']'
Checksum matched! Renaming /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/15617066—54118725—a—2021.07.26—2021_7.zip to /run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip
urllib3.connectionpool: https://kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 302 138
urllib3.connectionpool: Starting new HTTPS connection (1): data8.kemono.party:443
urllib3.connectionpool: https://data8.kemono.party:443 "GET /data/1a/8f/1a8fb2164388c738bfff94a4fb0963a54ae17c8a6e6fe6b8fd9b2bdd08b44f32.zip HTTP/1.1" 200 30496560
/run/media/XXXXX/bfd18/dl/gallery-dl/kemonoparty/patreon/15617066—catcube/54118725_ミドリ&モモイ(ブルーアーカイブ-Blue Archive-)_01_2021_7.zip
Why are the GETs and new connections being established if the file was verified to be skipped given the conditions?
EDIT: I should also add that the color behavior is odd as well. Usually, when files are skipped, the color of the font is not changed. In this case, the colors are turning that seafoam green color as if a download happened.
Is https://kemono.party/patreon/user/15617066 one of problems URLs?
It has:
Total: 285 (358)
Attachments: 285 (285)
Files: 73 (73)
Embeds: 0 (0)
Inlines: 0 (0)
With
"filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}.{extension}"
gallery-dl downloads 285 files as expected.
285 unique files of 358 total (with duplicates). Because of gallery-dl skips dups based on the hash by default.
In fact this filename pattern is suited to download all 358 files.
Set extractor.kemonoparty.duplicates
to true
and it will download all 358 files.
The same filename within one post may have only the file and one of attachments.
So, it's enough to use just the first letter of {type}
instead of {num}
.
Show me an example, if I'm wrong.
Is https://kemono.party/patreon/user/15617066 one of problems URLs?
No, it's unrelated and just an example of the verbose output to begin the rename process for that folder. The problem URL was given in my first post. The verbose output was merely to show connections being established to the server directly to the file even though gallery-dl had all the information needed to skip those connections, unless those connections are needed somehow even when skip is set to true.
I had verified with my old configuration, files with identical names were being skipped. Reprocessing the user with the new configuration renamed the files as well as grabbing the one it previously skipped. That is no longer the issue. The issue now is the creation of connections directly to the file even after all checks have passed for gallery-dl to skip that file even when it won't download it, it still establishes a connection to it.
Fuck. There are cases when the attachments of one post can be with same name. Helpfully, it happens very rare.
A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600
Downloading of this artist entire will have 3 unique (for the artist) images missed. (The artist has 252 unique images (265 with per post de-duplication), but currently it downloads only 262 (with this name pattern) and 372 total without de-duplication (344 are downloaded currently).)
Okay, the most trivial fix is to introduce a new key — unique_filename
, or better — filename_num
.
In order to use it at the end of {filename}
, for example, after trimming too long names {filename[0:140]}{filename_num:?_//}
.
So, it's the proper usage:
"filename": "[{category}] {user}—{id}—{type[0]}—{date:%Y.%m.%d}—{title}—{filename[0:140]}{filename_num:?_//}.{extension}",
"archive-format": "{service}_{user}_{type[0]}_{id}_{filename}{filename_num:?_//}.{extension}"
On the next run the missed images will be downloaded.
Well, but currently there are no filename_num
key.
@mikf could you add it?
I will also edit my prev comments in this case.
Here is a JavaScript "pseudo-code":
const attachments = [...]; // URLs
const nameCounts = new Map(); // here
let num = 0;
for (const attachment of attachments) {
const {filename, extension, hash} = parseUrl(attachment);
num++;
// and here
let filename_num = null; // None
const count = nameCounts.get(filename) || 0;
if (count > 0) {
filename_num = count;
}
nameCounts.set(filename, count + 1);
yield {filename, extension, hash, num, filename_num};
}
So, if some filename is appeared multiple time the associated file will have filename_num
with 1
, 2
... value. By default it should be None
. No prefix by default for the first file, and _1
, _2
in case a filename collision for the next files.
The more correct key name is attachment_filename_num
. It should count filenames per attachments, not per the post entirely.
UPD:
Also it's better to use attachment_filename_num
in my case after {type[0]}
, or {title}
(instead of after {filename}
) to prevent theoretical filename collision (if a post contains, for example: 2.png
, 2_1.png
, 2.png
attachments):
"filename": "[{category}] {user}—{id}—{type[0]}{filename_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{filename_num:?-//}_{id}_{filename}.{extension}"
Not ideally, but it will works fine. _(if attachment_filename_num
could be introduced)_
That's all.
A post example (NSFW): https://kemono.party/patreon/user/19911192/post/41033600
Yup, that's the one in my first post as well and what lead to my discovery. I discovered it completely by accident too lol.
I have fixed it the same way as it is in my JS example above:
post["type"] = file["type"]
post["num"] += 1
post["_http_headers"] = headers
post["_a_dup_name_num"] = file.get("_attachment_duplicate_filename_num", None)
if url[0] == "/":
url = self.root + "/data" + url
def _attachments(self, post):
In config add {_a_dup_name_num:?-//}
after {type[0]}
for "filename"
and "archive-format"
.
"filename": "[{category}] {user}—{id}—{type[0]}{_a_dup_name_num:?-//}—{date:%Y.%m.%d}—{title}—{filename[0:140]}.{extension}",
"archive-format": "{service}_{user}_{type[0]}{_a_dup_name_num:?-//}_{id}_{filename}.{extension}",
With this patch it works 100 % correctly. On next run you will download the missed files (attachments with duplicate names within one post).
+3 images with "duplicates": false
, or +28 with "duplicates": true
for https://kemono.party/patreon/user/19911192
.
@mikf I'm not only one who uses {type[0]}
, so I hope you will add this fix in any form.
I just attempted to redownload a few images from a kemono gallery where some images were skipped due to time-out, but this time with the archive database skip enabled, and it absolutely zoomed through the already existing images.
Something is clearly wrong when using the filename skip.
Where are both the filename skip and sqlite skip handled? I'm having trouble locating them.
Okay, a way to absolutely confirm is something is wrong with filename skip is to do the following:
This right there shows this is obviously a problem.
I have had to modify both my "dedupe" post(pre)processor and verify_file.sh
bash script as follows:
"deupe": {
"name": "exec",
"async": "false",
"command": [
"/run/media/xxx/bfd18/dl/verify_file.sh",
"{hash}",
"{_directory}{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}",
"{_path}",
"{_directory}",
"{user}—{id}—{type[0]}—{date:%Y.%m.%d}—"
],
"event": "prepare"
#!/bin/bash
readarray -d '' findarray < <(find "$4" -name "$5*" -print0)
if ! echo "$1 $2" | sha256sum --check --status --ignore-missing -; then
#If it isn't, lets see if an old part file exists
if [ -f "$2.part" ]; then
#if the part file exists and the new file doesn't already exist then rename the part file
if ! [ -f "$3" ]; then
echo "Partial data '$2.part' exists exists moving to '$3.part'!"
mv "$2.part" "$3.part"
#But if the hash file matches the new data, we don't need the .part anymore
elif echo "$1 $3" | sha256sum --check --status --ignore-missing -; then
echo "$3 hash matches new data, discarding old $2.part data"
rm "$2.part"
fi
else
for i in "${findarray[@]}"
do
if echo "$1 $i" | sha256sum --check --status --ignore-missing -; then
echo "Similarly named file found at $i with matching checksum!"
if ! [ -f "$3" ]; then
echo "Renaming old data $i to $3"
mv "$i" "$3"
elif echo "$1 $3" | sha256sum --check --status --ignore-missing -; then
echo "Checksum matched, but $3 also matches checkum. Deleting old $i"
rm "$i"
fi
fi
done
fi
sleep 1
exit 1
else
#If the new data doesn't exist and the checksum matches, rename it
if ! [ -f "$3" ]; then
echo "Checksum matched! Renaming $2 to $3" >&1
mv "$2" "$3"
#If the new data exists and the new data matches the checksum, delete the old data
elif echo "$1 $3" | sha256sum --check --status --ignore-missing -; then
echo "Checksum matched, but $3 also matches checkum. Deleting old $2."
rm "$2"
fi
sleep 1
exit 0
fi
The reason is that kemonoparty has, for a reason not known to me, changed filenames, so I can no longer trust the filename parameter passed to bash script to search for a pre-existing file and I need to recursively search based on other parameters.
I am now running my script through 504 already downloaded, incorrectly named, galleries, and many are detecting checksums and correctly renaming the file... however for larger files, such as .zip files.... it's downloading them.
If the prepare post processor really does execute in this order as stated above:
build path
run "prepare" post processors
check archive
check filesystem
The prepare post processor should have created satisfactory conditions for the "check filesystem" check for skipping, but it's just not...
Okay, this postprocessor isn't working, as it's yeeting files into the abyss. The files "exist", and can be opened directly if you use the path, but ls
and file explorers won't report them as being a file in their respective folder. Rebooting into Windows and running chkdisk
seems to be the only solution.
Can you run a python script instead of command line, and also can the postprocessor actually finish before gallery-dl resumes? I thought that's what "async": false,
but it seems to be random when it actually does wait or not.
EDIT: Time I learned find
does not play well with exec
EDIT2: I've re-written my script from the ground up, but it's still yeeting files into the void and only chkdisk
can allow the index to see them once again.
Alright, I rewrote the pre-postprocessor in python and it still was corrupting my index of my NTFS drive, and because I am on linux, I was using the newer Paragon ntfs3 driver and that apparently was not happy with the access and index rewriting. I changed the driver back to ntfs-3g and was able to remove around 260 Gigs of duplicate files.
Greetings,
So I happened to download a lot from kemonoparty with the following filename stetting
{user}—{id}—{type[0]}—{date:%Y.%m.%d}—{filename[0:140]}.{extension}
. It wasn't until yesterday that I realize certain patreon creators areabsolutely insaneinserting files with the same filename (ex. "2.png") in a single post, which resulted in gallery-dl skipping the identical filenames (skip set to true in extractor config).In order to avoid redownloading the entire kemono folder of data I already have, what I have done is setup a preprocessor rule to occur before downloading to check if the old filename exists and equal to the hash returned by gallery-dl:
extractor config:
where verify_file.sh is as follows:
I began executing this new configuration on the artist where I first noticed the post had two different "2.png" files in it (nsfw link). However, while things seemed to be going fine, I encountered this in my output:
Apparently, the (pre)postprocessor runs correctly moving the file because the checksum matched, however
skip:true
must not be checked after the (pre)postprocessor and before the download starts, because the (now) identical file exists in it's place yet it's redownloading the data anyway.What I am doing now to get around this is to use the
--no-download
flag to just rename the files. I will still have to rerun gallery-dl to download the missed items that I did not already have downloaded.I am still wondering if I just have a config issue or if this might be an oversight in the order of operations gallery-dl uses.
EDIT: Apparently
--no-download
is still writing to the archive, so I can't create/use an archive file until I process all the renaming issues...