mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
12.05k stars 981 forks source link

[kemono.party - Patreon] Inconsistencies downloading main files vs attachments #1899

Closed ghost closed 1 year ago

ghost commented 3 years ago

[This might look like a wall of text, but I don't think it's actually that much information. Thanks in advance.]

I am attempting to download some files from kemono.party, but the behaviour of the downloader seems inconsistent depending on whether the target post has its content uploaded as files or attachments, and which ones are duplicates (because of course that's still a problem on kemono.party). I am using gallery-dl 1.18.4-dev.

Target URLs [no nudity, but NSFW]:

https://kemono.party/patreon/user/4577256/post/53549884 (no content, 5 attachments, file 1 is duplicated)
https://kemono.party/patreon/user/4577256/post/52864412 (no content, 3 attachments, file 2 is duplicated)
https://kemono.party/patreon/user/4577256/post/50117542 (2 inline files in content, no attachments)

It might be worth noting that link 2 doesn't have any images listed under "content" on the page, but if you look at the image URLs you can see that the first image is under hostname/files/etc and the others are hostname/attachments/etc

The JSON for my gallery-dl config file:

"kemonoparty":
{
    <cookie data>
    "filename": {
        "service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
        ""            : "{id}-{num}.{extension}"
    },
    "image-filter": "extension != 'psd'"
}

I have configured it this way to force all Patreon attachment filenames to use underscores instead of spaces, which protects against duplicate files with slightly different filenames. It has worked for me for several months.

When using this config, I downloaded all images except for animation 1 from link 2, and there were no duplicates, but because of the filenames the order of each picture was jumbled. I tried to change the JSON to download everything and put them in the correct order:

"kemonoparty":
{
    <cookie data>
    "filename": {
        "service == 'patreon'": "{id}-{num}.{extension}",
        ""            : "{id}-{num}.{extension}"
    },
    "image-filter": "extension != 'psd'"
}

This config improved the filenames to be in order, but it didn't download the missing picture from the first config and it downloaded the duplicate animation from link 2.

I tried to see what keywords/filters I could use in the filename by using gallery-dl -K [link 2] but that did not seem to help: according to gallery-dl, the num (index) of each picture in that link starts at 1 with the duplicate animations. Even when I remove the distinction between Patreon and other services (or removed the filename block entirely), gallery-dl does not download the first animation.

In summary:

For reference, here is the command and verbose output when using the second config.

gallery-dl --verbose --dest . -o directory=[] -i targets.txt
[gallery-dl][debug] Version 1.18.4-dev
[gallery-dl][debug] Python 3.8.5 - Windows-7-6.1.7601-SP1
[gallery-dl][debug] requests 2.24.0 - urllib3 1.25.10
[1/3] https://kemono.party/patreon/user/4577256/post/53549884
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5354988
4'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/53549884'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5354988
4 HTTP/1.1" 200 None
# .\53549884-1.png
# .\53549884-2.png
# .\53549884-3.png
# .\53549884-4.png
[2/3] https://kemono.party/patreon/user/4577256/post/52864412
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5286441
2'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/52864412'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5286441
2 HTTP/1.1" 200 None
# .\52864412-1.gif
# .\52864412-2.gif
[3/3] https://kemono.party/patreon/user/4577256/post/50117542
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5011754
2'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/50117542'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5011754
2 HTTP/1.1" 200 None
# .\50117542-1.png
# .\50117542-2.png
mikf commented 3 years ago

What is causing the first animation in link 2 not to be downloaded?

The patreon-skip-file option. (#1689, 48647480) In all patreon posts on kemono that I've seen until now, it was always the main file that was a duplicate of another attachment file. but that doesn't seem to always hold true. (#1751)

Does gallery-dl distinguish between inline content, "files" content, and "attachment" content when downloading from a Patreon service on kemono.party?

There's a type metadata field that is either "file", "attachment", or "inline".

Have I simply configured something wrong?

You haven't, it's just that any attempt of fixing this "duplicate files for patreon posts" issue has always failed, including the current "ignore main file if there are attachments".

TestPolygon commented 3 years ago

BTW, for new files SHA-256 taken from the URL can be used to define are the files are same, or they just only have the same name.

ghost commented 3 years ago

Ah, I see. Thanks for clearing that up. I suppose I'll just have to download everything and manually remove duplicates, then.

The patreon-skip-file option. (#1689, 4864748) In all patreon posts on kemono that I've seen until now, it was always the main file that was a duplicate of another attachment file. but that doesn't seem to always hold true. (#1751)

Yeah. I think I made the issue that led to that option being included, actually. Heh.

Does gallery-dl distinguish between inline content, "files" content, and "attachment" content when downloading from a Patreon service on kemono.party?

There's a type metadata field that is either "file", "attachment", or "inline".

That's good to know. There may be something I use that for.

Have I simply configured something wrong? You haven't, it's just that any attempt of fixing this "duplicate files for patreon posts" issue has always failed, including the current "ignore main file if there are attachments".

Well, for what it's worth, the "ignore main file if there are attachments" approach does filter out the vast, vast majority of duplicates and it's mostly solved kemono's data duplication. I just seem to have found an artist or a post that happens to store data differently.

BTW, for new files SHA-256 taken from the URL can be used to define are the files are same, or they just only have the same name.

Is there a download comparison option in gallery-dl that does that? I've looked through some of the comparison options in the config documentation but I don't remember seeing something like that.

TestPolygon commented 3 years ago

It's the new URL format introduced 4 days ago. Currently not all files uses it.

skyvory commented 3 years ago

There are some cases where the images aren't posted in 'files' area, but 'content' area and the downloader skipped the content ones. The images aren't links, just inline.

mikf commented 3 years ago

@TestPolygon

Currently not all files uses it.

And they still do not, even more than a week later. Maybe these changes only got applied to patreon posts.

$ gallery-dl -g https://kemono.party/gumroad/user/trylsc/post/IURjT
https://kemono.party/data/files/gumroad/trylsc/IURjT/reward8.jpg
https://kemono.party/data/attachments/gumroad/trylsc/IURjT/$3.zip

@skyvory inline images are supposed to be supported, unless the URLs in newer posts got changed and aren't picked up by gallery-dl.

$ gallery-dl -g https://kemono.party/fanbox/user/7356311/post/802343
https://kemono.party/data/inline/fanbox/uaozO4Yga6ydkGIJFAQDixfE.jpeg
ghost commented 3 years ago

@mikf For the particular artist that I wanted to download, another factor may be that the inline images are links to an outside source (Imgur) instead of being direct uploads to Kemono. I'm not exactly sure how Patreon allows creators to upload images to posts, but if we look at https://kemono.party/patreon/user/4577256/post/53013824, and right click > view image/open image in new tab, we stay on kemono.party.

For my artist, you can look at https://kemono.party/patreon/user/4577256/post/53013824 (mostly SFW, some minor nudity), and right click > view image/open image in new tab, we are redirected to an Imgur page.

I'm not sure if this is something gallery-dl accounts for when crawling kemono patreon posts. From some minor testing, it doesn't seem to recognize that these embedded/inline images are even there.

In any event, the workaround that I'm using now is simple but somewhat tedious using JDownloader 2:

valdearg commented 3 years ago

Not sure if this is the best place, apologies. But I noticed with this URL that the main attachment 404s but the inline image isn't available to download:

https://kemono.party/patreon/user/7453087/post/33060907

Not too sure how that differs from the one posted earlier, which does come through as an inline post. Most likely because it has both a file and an inline image?

https://kemono.party/fanbox/user/7356311/post/802343

mikf commented 3 years ago

@valdearg fixed in https://github.com/mikf/gallery-dl/commit/db857b40d8e813926db44d00f4c95ea4544812b8. The inline image URL there started with https://kemono.party/ instead of the expected /inline.

valdearg commented 3 years ago

You're amazing! Thanks, that's got it!