Closed ghost closed 3 years ago
You can use "image-filter": "type != 'file'",
to skip downloading duplicate header and download only attachments.
But beware - due to a site bug(?) some posts only have the header.
Yeah, I mean this definitely seems like an issue with the site. The best would be to bring this up there, so it can get fixed.
The main issue is that the main file and the first attachment of any Patreon post refers to the same file. Before v1.18.0 this was "solved" by effectively using {filename}.{extension}
without {num}
as default filename format, so that those two identical files have the same filename and the second one gets skipped. That didn't work for other services like Fanbox where it would skip files even though they weren't identical, so the default filenames got changed because downloading duplicates is still better than outright missing files.
It should be possible to distinguish between Patreon and everything else with conditional filenames to use the {filename}
field there instead of {num}
:
"filename": {
"service == 'patreon'": "{id}-{filename}.{extension}",
"" : "{id}-{num}.{extension}"
}
Or with image-filter
like Skyofflad suggests, although I'd only apply it for patreon:
"image-filter": "service != 'patreon' or type != 'file'"
Yeah, I mean this definitely seems like an issue with the site. The best would be to bring this up there, so it can get fixed.
The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)
The main issue is that the main file and the first attachment of any Patreon post refers to the same file.
But they're not quite the same file, because one uses spaces and the other uses underlines. I tried downloading the example post with the filename
config block you provided, and it has the same problem: two identical images, one named 48126953-splat 1
and the other 48126953-splat_1
. The same was true of the image-filter
block.
The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)
Well that's good to hear, and it saves me the trouble of contacting them directly. Thanks for your help. I suppose I'll just wait for a fix and get used to cleaning out duplicates when I'm downloading from Patreon.
because one uses spaces and the other uses underlines
You could use {filename:R /_/}
to replace those spaces with underlines to make them match. Or the path-restrict
option.
I think it's more reasonable to change the Patreon extractor itself to convert spaces to underscores.
^ Better yet if the extractor can tell those files apart before starting to download the file, so that the downloading doesn't have to be aborted before moving to the next one.
You could use {filename:R /_/} to replace those spaces with underlines to make them match. Or the path-restrict option.
That works when it's the only formatting in the filename
value, but I get an error when I try to use different behaviour depending on the service:
[kemonoparty][error] FilenameFormatError: Applying filename format string failed (TypeError: expected str, got dict)
.
I think the issue is that the filename
format string is being processed as a dictionary
object in Python. I'm not sure what caused the issue, since I'm just copying the formatting from earlier in this thread.
"filename": {
"service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
"" : "{id}-{num}.{extension}"
}
I also came across posts (lots of them) where the header file and a completely different attachment shared the same filename, so I guess that's also an issue, specially for people that prefer the {id}+{filename} format. Sample 1 (nsfw): https://kemono.party/patreon/user/10215607/post/47941313 Sample 2 (nsfw): https://kemono.party/patreon/user/10215607/post/49961587
@jlazarskiparkin9815 filename
being a dict
is only supported since version 1.18.0 and raises a FilenameFormatError
/TypeError
in all prior versions.
@jlazarskiparkin9815
filename
being adict
is only supported since version 1.18.0 and raises aFilenameFormatError
/TypeError
in all prior versions.
Ah, that did it. I upgraded to 1.18.x and now the problem is solved when I use this filename format:
"filename": {
"service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
"" : "{id}-{num}.{extension}"
}
Testing the example from the OP gives me one file with the format id-num
, which is exactly what I want. It continues to behave the same way it always has for other services like SubscribeStar. Thanks.
I have noticed an inconsistency in kemono.party. In short, I cannot seem to find a way to configure kemono.party to download non-duplicate pictures from Patreon posts even though my configuration works with other data sources like SubscribeStar. I believe this is due to the way kemono.party displays images from these two sites.
Example post (NSFW but no nudity): https://kemono.party/patreon/user/2909939/post/48126953 My config file:
Attempting to download the example post with this config gets me two files with the same image and file size, but different names:
48126953-1.png
and48126953-2.png
. For a SubscribeStar post, I would only get48126953-1.png
, which is fine for my organization needs.I tried looking at the keywords for filenames to find something that would help, but there does not seem to be anything there that could help.
I also tried configuring a postprocessor option to compare images once they've been downloaded, but that has two problems:
compare.shallow
postprocessor option to work properly. I had thought that I could use that option to compare the filesizes as a pseudo-checksum, but either I configured it wrong or it didn't work.