mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
12.07k stars 982 forks source link

[Question] Creating a List of Tags to Ignore/Use for Booru sites #2446

Closed ShiroyukiX closed 2 years ago

ShiroyukiX commented 2 years ago

(Please excuse my inexperience with github. It's my first time ever posting here.)

How would I go about creating a blacklist of tags for booru sites (rule34us, gelbooru, danbooru, safebooru, etc) or any that use tags/tag-like systems in organizing artwork?

I checked the Issues page and configuration doc extensively and cannot find a solution. I know you can use --filter "'TAG' not in tag" but haven't seen any mention of a case with multiple tags besides --filter "'TAG1' not in tag and 'TAG2' not in tag". I tried image-filter but my understanding of Python and JSON is nonexistent. I tend to search these sites and download by artist or character, blocking any tags I dislike through a blacklist; I can't catch every single weird tag used for the same subject so the blacklist is big. I don't have any account with these websites, if that information is relevant.

Is this even possible to begin with? I want to believe there is a solution to filter the results without have to input multiple and statements into a command.

mikf commented 2 years ago

I wanted to suggest using something like

"image-filter": "not any(t in tags for t in ('tag1', 'tag2', 'tag3', 'tag4'))"

and that would theoretically work, but it doesn't due to how Python handles variable look-ups. You get a "NameError: name 'tags' is not defined" if you try, even though it is defined.

Fixing the root cause of this is possible, but complicated. Would it be OK to have a function that does this check, e.g. something like

"image-filter": "not contains(tags, ('tag1', 'tag2', 'tag3', 'tag4'))"

because adding just that is rather easy.

Hrxn commented 2 years ago

[..]. Would it be OK to have a function that does this check, e.g. something like

"image-filter": "not contains(tags, ('tag1', 'tag2', 'tag3', 'tag4'))"

because adding just that is rather easy.

I think this is the best solution for such cases, yes..

ShiroyukiX commented 2 years ago

[..]. Would it be OK to have a function that does this check, e.g. something like

"image-filter": "not contains(tags, ('tag1', 'tag2', 'tag3', 'tag4'))"

because adding just that is rather easy.

I think this is the best solution for such cases, yes..

i tried to use this for both rule34us and the base gelbooru module but the program spits out the following error:

gelbooru: FilterError: Evaluating filter expression failed (NameError: name 'contains' is not defined) rule34us: FilterError: Evaluating filter expression failed (NameError: name 'contains' is not defined)

this is how i have it written in the config file, which is a copy of the gallery-dl-example.conf with no changes. Let me know if you need the whole document. Maybe I'm placing it in the wrong spot?

NOTE: I tried these two tags with/without the underscore as a test for this.

        "rule34us":
        {
            "image-filter": "not contains(tags, ('azur_lane', 'genshin_impact'))"
        },

        "gelbooru":
        {
            "image-filter": "not contains(tags, ('azur_lane', 'genshin_impact'))"
        },
Hrxn commented 2 years ago

No, you're simply a bit too early, this function is not added yet! 😄

github-account1111 commented 2 years ago

Is this request for excluding certain tags from e.g. filenames or just straight up not downloading the files that contain certain tags?

ShiroyukiX commented 2 years ago

Is this request for excluding certain tags from e.g. filenames or just straight up not downloading the files that contain certain tags?

this is to avoid downloading files that contain certain tags.

mikf commented 2 years ago

What you tried in https://github.com/mikf/gallery-dl/issues/2446#issuecomment-1081744386 should now work (https://github.com/mikf/gallery-dl/commit/413b77757b13bd4670028eb8a5265dd0d2a86ac9), but be aware that different boorus have different tag structures. Some use underscores, some spaces, and for some it is called tag_string instead of tags.

ShiroyukiX commented 2 years ago

What you tried in #2446 (comment) should now work (413b777), but be aware that different boorus have different tag structures. Some use underscores, some spaces, and for some it is called tag_string instead of tags.

Had to install the latest dev version to test it out and it appears to work. On that note, for the different boorus, what command do I use to determine the tag structure before downloading? I assume -j.

Also, for tag_string, i replace (tags, ('tag1', 'tag2')) with (tag_string, ('tag1', 'tag2')) for boorus that use this option, if I'm understanding what you're saying is correct.

ShiroyukiX commented 2 years ago

I may have ran into an issue, though I cannot say if it's a filtering or tag reading problem (or whatever the heck it is).

While attempting to download from rule34us shinano_(azur_lane) artwork, it for some reason only grabs 10 files. I "sync" my image-filter and site blacklist for make sure I have both up-to-date and my current setup should have about 90 files download from the search; it does not. I'm certain that what appears in my search will download from previous attempts, so I don't believe it's a tag I filtered out.

EDIT: it does appear that gallery-dl is skipping these files. I tried downloading a specific result from what the search gave me and it doesn't appear in the log nor my directory. I checked the JSON with -j and none of the tags listed are filtered out.

EDIT2: Checked a new search with an artist and, again, it's downloading less than what my search result is giving (41 / 53). Commenting out the tags fixes the result so I'm assuming it's a tagging issue of some sort; not sure if it's only rule34us that does this.

ShiroyukiX commented 2 years ago

I may have ran into an issue, though I cannot say if it's a filtering or tag reading problem (or whatever the heck it is).

While attempting to download from rule34us shinano_(azur_lane) artwork, it for some reason only grabs 10 files. I "sync" my image-filter and site blacklist for make sure I have both up-to-date and my current setup should have about 90 files download from the search; it does not. I'm certain that what appears in my search will download from previous attempts, so I don't believe it's a tag I filtered out.

EDIT: it does appear that gallery-dl is skipping these files. I tried downloading a specific result from what the search gave me and it doesn't appear in the log nor my directory. I checked the JSON with -j and none of the tags listed are filtered out.

EDIT2: Checked a new search with an artist and, again, it's downloading less than what my search result is giving (41 / 53). Commenting out the tags fixes the result so I'm assuming it's a tagging issue of some sort; not sure if it's only rule34us that does this.

I may have figured out the problem, though results may vary.

Gallery-dl skips all sub-categorized versions of a tag if it's filtered with image-filter. For example, if you blacklist animal, gallery-dl will skip any image with a related tag (animal_ears, animal_humanoid, etc), even if animal is not used for the image. You have to blacklist the sub-categorized versions only to avoid the issue.

This only appears to happen with rule34us as their tagging system is janky (character tags are under general tags, for instance), but it may happen for other booru sites. Also, defining the first argument as tags_general seems to work as well.

I'm still not sure if this problem is site-specific or program-specific, though my conjecture favors the former. You may want to test it out for other sites to confirm your search result numbers match the number of files downloaded, and make sure to define your filter as mentioned in https://github.com/mikf/gallery-dl/issues/2446#issuecomment-1083328571 to catch the tags accurately.

YooPita commented 1 year ago

I will leave the information here, since there are no more similar topics on the Internet. Perhaps someone will be useful. I needed to exclude two tags at the same time on the e621 site and I did not understand how. But after researching the problem, I came up with the following solution for the filter:

--filter "not ('tag1' in tags['general'] and 'tag2' in tags['general'])"

You can also use it in a config file gallery-dl.conf:

"image-filter": "not ('tag1' in tags['general'] and 'tag2' in tags['general'])" or additional example "image-filter": "not ('tag1' in tags['general'] and ('tag2' in tags['general'] or 'tag3' in tags['general']))"

rautamiekka commented 1 year ago

"image-filter": "not ('tag1' in tags['general'] and ('tag2' in tags['general'] or 'tag3' in tags['general']))"

Code like that insta-boils my piss no matter the language, instead try this untested adaptation of this StackOverflow answer:

"image-filter": "any(True for _e in ('tag1','tag2','tag3') if _e in tags['general'])"

Not sure if you need a list comprehension (any([...])) for this trick since I didn't test it, I sure hope not.

It should return immediately if a blacklisted keyword is found and being a generator, should use minimal RAM.