mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.7k stars 953 forks source link

Avoid downloading unnecessary metadata files from kemono #2246

Closed lodx-xd closed 2 years ago

lodx-xd commented 2 years ago

I am using the options below to save kemono posts that have external links on them.

"postprocessors":[{

    "name": "metadata",
    "event": "post",
    "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
    "mode": "custom",
    "format": "{content}\n{embed[url]:?/\n/}",
    "directory": "metadata"
}]

My problem is just the absurd amount of files I end up, as the archive function does not archive metadata, every time I run my list of kemono galleries that are in a .txt file I end up with 20k+ .json files again and again, it would not be so much of a problem if it was only the posts with url (because it would only generate a few files), but it downloads everything even the files that end up empty due to using "format": "{content}{embed[url]:?/\n/}" I have no problem extracting the urls using Powershell, I just would like to avoid creating 20k+ files every time I run my current list of galleries, So there is any way I can avoid or diminish this?

Sorry if this falls into "The XY Problem" but I have thought in only one solution in case it can't be avoided somehow, that is to use the matadata command only on artists that I know use external links.

However I have no idea how to create specific configurations for artist on the config.json (if is even possible) or pass the metadata option I use to CLI (if is even possible), other than that I think I would need to run these galleries separate from the list that is in the .txt file, because I guess is not possible include additional command lines beyond the url on a list file correct?

mikf commented 2 years ago

Each post processor can have a filter option (same as --filter) that determines whether it runs or not. In this case, you can let it check if there is an embed link:

    "filter": "embed",

You can use the same mechanism to only run the post processor for specific users:

    "filter": "user in ("123", "234", "345") and embed",

or you specify a post processor with a specific name like here and run gallery-dl with -P NAME:

{
    "extractor": {
        "#": ""
    },
    "postprocessor": {
        "embeds": {
            "name": "metadata",
            "event": "post",
            "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
            "mode": "custom",
            "format": "{content}\n{embed[url]:?/\n/}",
            "filter": "embed",
            "directory": "metadata"
        }
    }
}
lodx-xd commented 2 years ago

Thanks for the answer, this is exactly what I need, but I'm having some problem if you can help me further.

The filter for the users work just fine, now I can filter only the galleries I want to download metadata, but the "filter": "embed", that should give me the best result, don't seems to get all posts with URLs.

Example:

- Posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/8349916 the filter download the metadata just fine. - But posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/61475292 the filter will not download the metadata.

Unfortunately, I couldn't figure out on my own what I should do here.

EDIT: Some things that I realize just now.

I just need to add the content together with the filter and it will work as expected, I thought I was filtering the URLs of the content but I was fundamentally wrong, Just to confirm and I will close the issue, it is possible to filter only contents that contains URLs, or is not possible?

mikf commented 2 years ago

You also meant URLs in content? I somehow assumed you only meant the URLs in {embed[url]}, which is why my examples only filtered by embed.

Checking for URLs in content should be possible with something like 'http://' in content or 'https://' in content

To combine the two: embed or 'http://' in content or 'https://' in content

lodx-xd commented 2 years ago

Thank you so much, this is exactly what I was missing!

AlttiRi commented 2 years ago

Keep in mind that a URL may be not prefixed with https://, or http://. It's more reliable to save the content of all posts, then just concatenate all files into one and visually parse it (with a search tool) for links.


I store each entry in a formatted HTML files, them I concat them with this Git-Bash For Windows command:

cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;

It creates a temporal HTML file and opens it in a browser.

Also it's better to add an alias in .bashrc file for this command:

alias catahtml='cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;'

With it just type catahtml in a directory to run the command.

AlttiRi commented 2 years ago

Also since it's a HTML file opened in a browser, you can parse for links with a JS script, which you can run it in the browser's console.