mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.19k stars 913 forks source link

[kemonoparty] patreon-skip-file=false behavior change #1991

Closed God-damnit-all closed 2 years ago

God-damnit-all commented 2 years ago

Most often (but not always) when the main file of a patreon post on kemono is different from the attachment of the same name, it's because it's a thumbnail.

I use compare enumerate, so having the 'correct' file for the post download first instead of last would be better. In other words, I'd like the attachments for kemonoparty's patreon posts to download first, and then, if patreon-skip-file=false, the file.

If you feel the change in behavior might mess up other people's setups, maybe it could be a new setting called "after"?

mikf commented 2 years ago

Patreon got a files option quite recently, which lets you set the order in which certain categories of files get downloaded. (8d676151) I could add something like that for kemono.

Regarding 'patreon-skip-file', I should remove this option altogether and use the file hashes that are now present in all(?) Patreon URLs to filter out duplicates. This option never 100% worked to begin with.

God-damnit-all commented 2 years ago

Patreon got a files option quite recently, which lets you set the order in which certain categories of files get downloaded. (8d67615) I could add something like that for kemono.

Regarding 'patreon-skip-file', I should remove this option altogether and use the file hashes that are now present in all(?) Patreon URLs to filter out duplicates. This option never 100% worked to begin with.

The file hashing option sounds great. Maybe it could include an option to append a portion of the hash to conflicting metadata filenames (before the extension)? Like the last 7 similar to what GitHub does.

So if more than one file in a metadata filename key is 1.png, but they both have different hashes, you'd end up with 1_abcdefg.png and 1_gfedcba.png, for instance.

Skyofflad commented 2 years ago

Regarding 'patreon-skip-file', I should remove this option altogether and use the file hashes that are now present in all(?) Patreon URLs to filter out duplicates.

Sadly, not all of them For example in this post the last url does not include hash

God-damnit-all commented 2 years ago

There could be an option to download and manually hash check files that don't have reported hashes.

Additionally, the order attachments are processed should be changed. I notice a lot of the duplicates are files that end in _\d. This is what I would do:

If a filename regex matches (.*)_\d, check if Capture Group 1 matches any other filename - this includes, if enabled, the main file (with its spaces considered to be underscores for the purposes of this check). If the match returns true, move it to a new queue that processes after everything else. Then, once these checks are done, sort this new queue lexicographically.

Doofy420 commented 2 years ago

From what I've observed, in case of hash dupes, it's always(?) the 1st image that has the incorrect file name, so I think those should be skipped instead of attachments or post files. The way it currently is, not only do you get incorrect filenames, but a weirdly ordered files as well (img > zip > img again).

Not sure if this warrants a new issue tho.

God-damnit-all commented 2 years ago

From what I've observed, in case of hash dupes, it's always(?) the 1st image that has the incorrect file name, so I think those should be skipped instead of attachments or post files. The way it currently is, not only do you get incorrect filenames, but a weirdly ordered files as well (img > zip > img again).

Not sure if this warrants a new issue tho.

The first image, by default, is the postfile. Try this: -o "extractor.kemonoparty.files=attachments,inline,file"

Doofy420 commented 2 years ago

Thanks, that works. I just placed inline as the 3rd, seems to work better for posts like this one, or maybe it's just me (I use {id}_{num}...) https://kemono.party/patreon/user/4052716/post/36397649

God-damnit-all commented 2 years ago

From what I've observed, in case of hash dupes, it's always(?) the 1st image that has the incorrect file name, so I think those should be skipped instead of attachments or post files. The way it currently is, not only do you get incorrect filenames, but a weirdly ordered files as well (img > zip > img again). Not sure if this warrants a new issue tho.

The first image, by default, is the postfile. Try this: -o "extractor.kemonoparty.files=attachments,inline,postfile"

Due to https://github.com/mikf/gallery-dl/commit/9bc83af3a6ef3ab803dde6bcaf2e58a91db3b613 it's now -o "extractor.kemonoparty.files=attachments,inline,file"

mikf commented 2 years ago

postfile got renamed to just file for kemono to be consistent with the type values (https://github.com/mikf/gallery-dl/commit/9bc83af3a6ef3ab803dde6bcaf2e58a91db3b613) so it would be attachments,file,inline. Maybe the defaults should be changed to that? It's similar on Patreon. (images,attachments,postfile,content)

God-damnit-all commented 2 years ago

postfile got renamed to just file for kemono to be consistent with the type values (9bc83af) so it would be attachments,file,inline. Maybe the defaults should be changed to that? It's similar on Patreon. (images,attachments,postfile,content)

attachments,file,inline definitely works the best for every scraped service that isn't patreon, for that I would go attachments,inline,file, but the difference is minor.