mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.36k stars 925 forks source link

deduplicating #6099

Closed yggdrasil75 closed 1 week ago

yggdrasil75 commented 2 weeks ago

I am trying to download images from various boorus to use to train a lora. I want to make sure that I dont get duplicates. I am using "filename": "{md5}.{extension}" but I am getting "none" instead of an md5. I tried checksummd5 as well, but still get none. how do I prevent duplicates between multiple boorus using a consistent format without getting "none"?

Hrxn commented 2 weeks ago

{md5} should be right, given that these values are provided by the booru. You can always check with gallery-dl -K YourURL though.

"none" means there's probably something wrong with your config here.

Assuming that all your targeted boorus actually provide MD5 sums, and that they are actually correct between different boorus, you should try using settings like these for the booru sites:

            "archive-prefix": "",
            "archive-format": "{md5|hash}",
            "archive": "~/gallery-dl/archives/single_archive_all_boorus.db",

That is, use a single archive file for all targeted boorus, and use the same archive format setting, here first trying md5, and if not available, then hash, should the case arise that different boorus use different metadata names here.

The other, maybe even more simpler and robust solution (but hey, if this is just for training data, do some misses really matter?):

First download, and then deduplicate on your drive.

yggdrasil75 commented 2 weeks ago

dedup on my own may be what I go with. I am also trying to merge the resultant tags so that if an image on 1 booru has only 5 tags, while another has 50, I get 55 tags instead of just 5 (or just 50, whichever is last I guess) unless there is a way to have the metadata field be appended to rather than replaced. I tried md5|hash in the archive prefix field and didnt get a difference.

Hrxn commented 2 weeks ago

Whoops, sorry, I've made a mistake in my comment above (now corrected). The value stored in the archive is "archive-format", while at the same time "archive-prefix" needs to be turned off for this all sites in a single archive thing to work.

mikf commented 2 weeks ago

I am using "filename": "{md5}.{extension}" but I am getting "none"

Nearly all *booru sites provide an {md5} value. The one exception I can think of is e621, where this value can be accessed as {file[md5]} or {filename}.

{md5|file[md5]|filename}
yggdrasil75 commented 2 weeks ago

is there a way to manually generate a hash upon user request? whether that hashes the image or the file, might allow more consistency. I know it will slow it down if you have good download speed but a slow cpu, but hashing the image content without the metadata might allow me to find the same image automatically more easily to prevent duplicates from a source site and a booru. especially if a site provides an md5 that may be incorrect for whatever reason (database corruption or bad practices or whatever)

a84r7a3rga76fg commented 2 weeks ago

especially if a site provides an md5 that may be incorrect for whatever reason

Happens a lot on Danbooru, Gelbooru and all of the other Boorus that I've tested on.

mikf commented 1 week ago

is there a way to manually generate a hash upon user request?

https://github.com/mikf/gallery-dl/commit/ae9b0da75539c2e649e53d9308c40f87196a3e9f

"postprocessors": ["hash"] or -P hash will generate MD5 and SHA1 hash digests for downloaded files, but you can also use more sophisticated options for other hashes etc.