Closed lodx-xd closed 2 years ago
Each post processor can have a filter
option (same as --filter
) that determines whether it runs or not. In this case, you can let it check if there is an embed link:
"filter": "embed",
You can use the same mechanism to only run the post processor for specific users:
"filter": "user in ("123", "234", "345") and embed",
or you specify a post processor with a specific name like here and run gallery-dl with -P NAME
:
{
"extractor": {
"#": ""
},
"postprocessor": {
"embeds": {
"name": "metadata",
"event": "post",
"filename": "{date:%Y-%m-%d}_{id}_{title}.json",
"mode": "custom",
"format": "{content}\n{embed[url]:?/\n/}",
"filter": "embed",
"directory": "metadata"
}
}
}
Thanks for the answer, this is exactly what I need, but I'm having some problem if you can help me further.
The filter for the users work just fine, now I can filter only the galleries I want to download metadata, but the "filter": "embed"
, that should give me the best result, don't seems to get all posts with URLs.
Example:
- Posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/8349916 the filter download the metadata just fine.
- But posts with this format (NSFW) https://kemono.party/patreon/user/562488/post/61475292 the filter will not download the metadata.
Unfortunately, I couldn't figure out on my own what I should do here.
EDIT: Some things that I realize just now.
I just need to add the content
together with the filter and it will work as expected, I thought I was filtering the URLs of the content but I was fundamentally wrong, Just to confirm and I will close the issue, it is possible to filter only contents that contains URLs, or is not possible?
You also meant URLs in content
? I somehow assumed you only meant the URLs in {embed[url]}
, which is why my examples only filtered by embed
.
Checking for URLs in content
should be possible with something like
'http://' in content or 'https://' in content
To combine the two:
embed or 'http://' in content or 'https://' in content
Thank you so much, this is exactly what I was missing!
Keep in mind that a URL may be not prefixed with https://
, or http://
.
It's more reliable to save the content of all posts, then just concatenate all files into one and visually parse it (with a search tool) for links.
I store each entry in a formatted HTML files, them I concat them with this Git-Bash For Windows command:
cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;
It creates a temporal HTML file and opens it in a browser.
Also it's better to add an alias in .bashrc
file for this command:
alias catahtml='cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;'
With it just type catahtml
in a directory to run the command.
Also since it's a HTML file opened in a browser, you can parse for links with a JS script, which you can run it in the browser's console.
I am using the options below to save kemono posts that have external links on them.
My problem is just the absurd amount of files I end up, as the archive function does not archive metadata, every time I run my list of kemono galleries that are in a .txt file I end up with 20k+ .json files again and again, it would not be so much of a problem if it was only the posts with url (because it would only generate a few files), but it downloads everything even the files that end up empty due to using
"format": "{content}{embed[url]:?/\n/}"
I have no problem extracting the urls using Powershell, I just would like to avoid creating 20k+ files every time I run my current list of galleries, So there is any way I can avoid or diminish this?Sorry if this falls into "The XY Problem" but I have thought in only one solution in case it can't be avoided somehow, that is to use the matadata command only on artists that I know use external links.
However I have no idea how to create specific configurations for artist on the config.json (if is even possible) or pass the metadata option I use to CLI (if is even possible), other than that I think I would need to run these galleries separate from the list that is in the .txt file, because I guess is not possible include additional command lines beyond the url on a list file correct?