mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.82k stars 969 forks source link

[Feature Request] KemonoParty: Preventing duplicates with revisions #6096

Open tezrilet opened 2 months ago

tezrilet commented 2 months ago

I'm currently downloading all post revisions and organizing them with the following directory structure:

{username}/{service}/[{id}] ({date:%Y-%m-%d}) {title[:100]}

However, both the post's date and title could change between revisions. I can send example URLs privately for both situations, if needed. This creates a duplicate folder, and could potentially eat up space with all the content being redownloaded. I have also tried using {published[:10]} instead of date, but in later revisions it can be null, so it duplicates using "None". Though, that still wouldn't address title changes.

If there currently isn't a way to solve this (aside from obviously not using date/title), could we get some extra options to use with the format strings, such as {earliest_revision_date}, {latest_revision_date}, {earliest_revision_title}, and {latest_revision_title}?

a84r7a3rga76fg commented 2 months ago

Remove the title from the file name. Download the post's unique files only. Save the metadata of the post from Kemono and use it to sort the files with symbolic or hard links to not waste any storage space. Replace creator_id and post_id in the URL with the correct ID of the creator and post. Trying to sort files from Kemono without wasting space will only lead to frustration.

https://kemono.su/api/v1/service/user/creator_id/post/post_id

"archive-format": "{subcategory}_{user}_{id}_{hash}",
"archive": "~/gallery-dl/archives/kemono/{subcategory} kemono {user}.sqlite",
"directory": ["{subcategory} kemono {user}", "{date!s:.10} {id}"],
"filename": "{hash}.{extension}"
tezrilet commented 2 months ago

Thanks, but

(aside from obviously not using date/title)

Yes, I'm already saving unique files using their hash. However, your suggestion still uses date, so it'll still create duplicates. I appreciate the suggestion, though! It's just that it'd be nice to navigate things with a file browser and search while using meaningful paths with a title. I don't mind if it has to make an extra request to get the earliest revision date/title, since I'm already grabbing them all anyway.

Hrxn commented 2 months ago

The suggestion above does not use date in the archive-format, though?

tezrilet commented 2 months ago

I was referring to the directory option ("directory": ["{subcategory} kemono {user}", "{date!s:.10} {id}"],), but the problem still occurs because the date changes between revisions. I tried it, and while it does prevent duplicating the files, I still end up with multiple folders. Ideally, I want to use a fixed value for the date and title, such as the earliest ones available from the first revision. I can provide a list of URLs privately if needed.

tezrilet commented 1 week ago

@mikf Bumping since an admin announced that Kemono is shutting down on November 22nd. Since this issue never got a label, is it considered a won't do/out of scope?

mikf commented 1 week ago

You might as well consider this "won't fix" then, as there is a good chance the next release will be after 2024.11.22.

If there currently isn't a way to solve this (aside from obviously not using date/title), could we get some extra options to use with the format strings, such as {earliest_revision_date}, {latest_revision_date}, {earliest_revision_title}, and {latest_revision_title}?

Each revision has 4 metadata fields:

The earliest revision entry has a revision_index value of 1, the latest a revision_index == revision_count and its revision_id is 0.

Using conditional file/directory names, you could do do something like

        "directory": {
            "revision_index == 1"             : ["{username}", "{service}", "[{id}]", "earliest revision: {title}"],
            "revision_index == revision_count": ["{username}", "{service}", "[{id}]", "latest revision: {title}"],
            "":                                 ["{username}", "{service}", "[{id}]", "{revision_id}: {title}"]
        }
tezrilet commented 1 week ago

Thanks for the suggestion, though that still creates duplicate files. I think what I may end up having to do is to only use the post ID as the folder name, then write a script to rename them properly after downloading.

mikf commented 1 week ago

though that still creates duplicate files

It won't if you use an archive with {hash} as archive-format, as suggested by https://github.com/mikf/gallery-dl/issues/6096#issuecomment-2313905785

tezrilet commented 3 days ago

I might try that instead, now that I think about it. The majority of content shouldn't have that many revisions, so fixing the few duplicate folders might be easier.

In regards to the comment I just left, https://github.com/mikf/gallery-dl/issues/6415#issuecomment-2453554698, would this still be a viable feature to add?