mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.5k stars 942 forks source link

Kemono.party Patreon posts always contain duplicate images #1667

Closed ghost closed 3 years ago

ghost commented 3 years ago

I have noticed an inconsistency in kemono.party. In short, I cannot seem to find a way to configure kemono.party to download non-duplicate pictures from Patreon posts even though my configuration works with other data sources like SubscribeStar. I believe this is due to the way kemono.party displays images from these two sites.

Example post (NSFW but no nudity): https://kemono.party/patreon/user/2909939/post/48126953 My config file:

"kemonoparty":
    {
        <cookie data>,
                 "filename":"{id}-{num}.{extension}"
    },

Attempting to download the example post with this config gets me two files with the same image and file size, but different names: 48126953-1.png and 48126953-2.png. For a SubscribeStar post, I would only get 48126953-1.png, which is fine for my organization needs.

I tried looking at the keywords for filenames to find something that would help, but there does not seem to be anything there that could help.

I also tried configuring a postprocessor option to compare images once they've been downloaded, but that has two problems:

Skyofflad commented 3 years ago

You can use "image-filter": "type != 'file'", to skip downloading duplicate header and download only attachments. But beware - due to a site bug(?) some posts only have the header.

Hrxn commented 3 years ago

Yeah, I mean this definitely seems like an issue with the site. The best would be to bring this up there, so it can get fixed.

mikf commented 3 years ago

The main issue is that the main file and the first attachment of any Patreon post refers to the same file. Before v1.18.0 this was "solved" by effectively using {filename}.{extension} without {num} as default filename format, so that those two identical files have the same filename and the second one gets skipped. That didn't work for other services like Fanbox where it would skip files even though they weren't identical, so the default filenames got changed because downloading duplicates is still better than outright missing files.

It should be possible to distinguish between Patreon and everything else with conditional filenames to use the {filename} field there instead of {num}:

    "filename": {
        "service == 'patreon'": "{id}-{filename}.{extension}",
        ""                    : "{id}-{num}.{extension}"
    }

Or with image-filter like Skyofflad suggests, although I'd only apply it for patreon: "image-filter": "service != 'patreon' or type != 'file'"

Yeah, I mean this definitely seems like an issue with the site. The best would be to bring this up there, so it can get fixed.

The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)

ghost commented 3 years ago

The main issue is that the main file and the first attachment of any Patreon post refers to the same file.

But they're not quite the same file, because one uses spaces and the other uses underlines. I tried downloading the example post with the filename config block you provided, and it has the same problem: two identical images, one named 48126953-splat 1 and the other 48126953-splat_1. The same was true of the image-filter block.

The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)

Well that's good to hear, and it saves me the trouble of contacting them directly. Thanks for your help. I suppose I'll just wait for a fix and get used to cleaning out duplicates when I'm downloading from Patreon.

mikf commented 3 years ago

because one uses spaces and the other uses underlines

You could use {filename:R /_/} to replace those spaces with underlines to make them match. Or the path-restrict option.

rautamiekka commented 3 years ago

I think it's more reasonable to change the Patreon extractor itself to convert spaces to underscores.

^ Better yet if the extractor can tell those files apart before starting to download the file, so that the downloading doesn't have to be aborted before moving to the next one.

ghost commented 3 years ago

You could use {filename:R /_/} to replace those spaces with underlines to make them match. Or the path-restrict option.

That works when it's the only formatting in the filename value, but I get an error when I try to use different behaviour depending on the service: [kemonoparty][error] FilenameFormatError: Applying filename format string failed (TypeError: expected str, got dict).

I think the issue is that the filename format string is being processed as a dictionary object in Python. I'm not sure what caused the issue, since I'm just copying the formatting from earlier in this thread.

"filename": {
    "service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
    ""            : "{id}-{num}.{extension}"
}
Doofy420 commented 3 years ago

I also came across posts (lots of them) where the header file and a completely different attachment shared the same filename, so I guess that's also an issue, specially for people that prefer the {id}+{filename} format. Sample 1 (nsfw): https://kemono.party/patreon/user/10215607/post/47941313 Sample 2 (nsfw): https://kemono.party/patreon/user/10215607/post/49961587

mikf commented 3 years ago

@jlazarskiparkin9815 filename being a dict is only supported since version 1.18.0 and raises a FilenameFormatError/TypeError in all prior versions.

ghost commented 3 years ago

@jlazarskiparkin9815 filename being a dict is only supported since version 1.18.0 and raises a FilenameFormatError/TypeError in all prior versions.

Ah, that did it. I upgraded to 1.18.x and now the problem is solved when I use this filename format:

"filename": {
    "service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
    ""            : "{id}-{num}.{extension}"
}

Testing the example from the OP gives me one file with the format id-num, which is exactly what I want. It continues to behave the same way it always has for other services like SubscribeStar. Thanks.