Avoid downloading unnecessary metadata files from kemono

lodx-xd commented 2 years ago

I am using the options below to save kemono posts that have external links on them.

"postprocessors":[{

    "name": "metadata",
    "event": "post",
    "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
    "mode": "custom",
    "format": "{content}\n{embed[url]:?/\n/}",
    "directory": "metadata"
}]

My problem is just the absurd amount of files I end up, as the archive function does not archive metadata, every time I run my list of kemono galleries that are in a .txt file I end up with 20k+ .json files again and again, it would not be so much of a problem if it was only the posts with url (because it would only generate a few files), but it downloads everything even the files that end up empty due to using "format": "{content}{embed[url]:?/\n/}" I have no problem extracting the urls using Powershell, I just would like to avoid creating 20k+ files every time I run my current list of galleries, So there is any way I can avoid or diminish this?

Sorry if this falls into "The XY Problem" but I have thought in only one solution in case it can't be avoided somehow, that is to use the matadata command only on artists that I know use external links.

However I have no idea how to create specific configurations for artist on the config.json (if is even possible) or pass the metadata option I use to CLI (if is even possible), other than that I think I would need to run these galleries separate from the list that is in the .txt file, because I guess is not possible include additional command lines beyond the url on a list file correct?

mikf commented 2 years ago

Each post processor can have a filter option (same as --filter) that determines whether it runs or not. In this case, you can let it check if there is an embed link:

    "filter": "embed",

You can use the same mechanism to only run the post processor for specific users:

    "filter": "user in ("123", "234", "345") and embed",

or you specify a post processor with a specific name like here and run gallery-dl with -P NAME:

{
    "extractor": {
        "#": ""
    },
    "postprocessor": {
        "embeds": {
            "name": "metadata",
            "event": "post",
            "filename": "{date:%Y-%m-%d}_{id}_{title}.json",
            "mode": "custom",
            "format": "{content}\n{embed[url]:?/\n/}",
            "filter": "embed",
            "directory": "metadata"
        }
    }
}

lodx-xd commented 2 years ago

Thanks for the answer, this is exactly what I need, but I'm having some problem if you can help me further.

The filter for the users work just fine, now I can filter only the galleries I want to download metadata, ~~but the "filter": "embed", that should give me the best result, don't seems to get all posts with URLs.~~

~~Example:~~

~~- Posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/8349916 the filter download the metadata just fine.~~ ~~- But posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/61475292 the filter will not download the metadata.~~

~~Unfortunately, I couldn't figure out on my own what I should do here.~~

EDIT: Some things that I realize just now.

I just need to add the content together with the filter and it will work as expected, I thought I was filtering the URLs of the content but I was fundamentally wrong, Just to confirm and I will close the issue, it is possible to filter only contents that contains URLs, or is not possible?

mikf commented 2 years ago

You also meant URLs in content? I somehow assumed you only meant the URLs in {embed[url]}, which is why my examples only filtered by embed.

Checking for URLs in content should be possible with something like 'http://' in content or 'https://' in content

To combine the two: embed or 'http://' in content or 'https://' in content

lodx-xd commented 2 years ago

Thank you so much, this is exactly what I was missing!

AlttiRi commented 2 years ago

Keep in mind that a URL may be not prefixed with https://, or http://. It's more reliable to save the content of all posts, then just concatenate all files into one and visually parse it (with a search tool) for links.

I store each entry in a formatted HTML files, them I concat them with this Git-Bash For Windows command:

cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;

It creates a temporal HTML file and opens it in a browser.

Also it's better to add an alias in .bashrc file for this command:

alias catahtml='cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;'

With it just type catahtml in a directory to run the command.

AlttiRi commented 2 years ago

Also since it's a HTML file opened in a browser, you can parse for links with a JS script, which you can run it in the browser's console.

mikf / gallery-dl

Avoid downloading unnecessary metadata files from kemono #2246