Closed mikf closed 4 months ago
@mikf For kemono.party, I wanted to use the edited field of the metadata, but it doesn't seem like %Y %m %d etc. works on it. I think it's because it's formatted differently in the keywords:
date
2020-04-06 03:04:29
edited
Fri, 18 Dec 2020 23:39:27 GMT
extension
png
filename
287cbb42-026d-42dc-a045-fc945c91e8fa
id
35689730
num
1
published
Mon, 06 Apr 2020 03:04:29 GMT
Is there a way to make this work, and possibly even fall back to date if for some reason the edited keyword is empty/null?
Is there a way to make this work
Not with the current string formatting options. There is a way to parse and format timestamps, but not textual date/time info.
fall back to date if for some reason the edited keyword is empty/null?
Theoretically with |
({edited|date}
), but both fields must support any eventual format specifiers or is throws an exception. {edited|date:%Y%m%d}
for example wouldn't work since the string value in edited
can't be formatted with %Y%m%d
.
what happen with --abort-on-skip ? i can't use it anymore, ty
@wankio it got replaced with --abort
(or -A
) back in 2019 (6393b47db2b8daacd34c837fa31a4641fc908272, ed6592ea1aaa8661554cd3eab6a454fcd5083544). It now requires an argument that specifies how many files to skip before stopping. --abort-on-skip
was the same as -A 0
.
Is it possible to modify postprocessors options via command line using -o
? I'm guessing I might have to rely on having an alternate config file instead, but I just want to make sure.
@ImportTaste not possible at the moment. Try the things mentioned in https://github.com/mikf/gallery-dl/discussions/1933.
I have a pretty annoying issue. For some reason, one of the patreon posts on kemonoparty has an invalid id and I'm getting this error with a 123456789 > id
filter: ValueError: invalid literal for int() with base 10: 'PbFkhZdV'
I haven't been able to figure out a good workaround. I'm sure it is a patreon post and not one of the other services on the site, I triple-checked. I posted the URL here: https://snippet.host/adgg
Is there a way to set different options depending on the OS? For example set the path of the logfile to ~/gallery-dl/ if it's Linux or to D:\Images\gallery-dl\ if it's Windows
@ImportTaste you could use isdecimal()
to check if an id is a number and convertible to int
:
id.isdecimal() and 123456789 > int(id)
to exclude or
not id.isdecimal() or 123456789 > int(id)
to include posts with non-numeric IDs
@Fukitsu only by using separate config files.
You could also use 3: one with Windows specific paths, one with Linux specific paths, and a third with general settings for both Windows and Linux. Settings from multiple config files will just get merged together.
Is there any way to have a format string that surrounds a metadata value with curly brackets in the filename? I tried using escape characters as well as escaping the escape characters (once for json and again for python), and I even tried doing \u007B and \u007D (the unicode escape sequences for them), but no luck.
The main reason I wanted to use them was for titles, I never really see artists put curly brackets in the titles for their stuff, but they use parentheses and square brackets all the time, so I want to surround the titles with curly brackets.
EDIT: If this isn't currently possible, maybe the format string interpreter could be changed so that unicode escape sequences are only interpreted after the replacement fields are?
To use curly brackets as regular characters in format strings, you have to double them.
{key}
-> <value of 'key'>
{{key}}
-> {key}
https://docs.python.org/3/library/string.html#format-string-syntax
If you need to include a brace character in the literal text, it can be escaped by doubling:
{{
and}}
.
To use curly brackets as regular characters in format strings, you have to double them.
{key}
-><value of 'key'>
{{key}}
->{key}
https://docs.python.org/3/library/string.html#format-string-syntax
If you need to include a brace character in the literal text, it can be escaped by doubling:
{{
and}}
.
But does this work within replacement fields too? i.e. {title:? {{/}}/}
But does this work within replacement fields too? i.e.
{title:? {{/}}/}
It seems like double brackets aren't even needed after the :
.
Just {title:? {/}/}
works for me for some reason.
Alternatively you could also use conditional filenames. One format string with a valid title
and one without.
But does this work within replacement fields too? i.e.
{title:? {{/}}/}
It seems like double brackets aren't even needed after the
:
. Just{title:? {/}/}
works for me for some reason.Alternatively you could also use conditional filenames. One format string with a valid
title
and one without.
Conditional filenames? You have my attention, that sounds incredibly useful.
EDIT: Found the details: https://github.com/mikf/gallery-dl/issues/1394 Includes details on conditional directories too.
They are a thing since 1.18.0/.1: 4cf40434, 84d2e640, fd00d471
You can select a different format string depending on the result of a Python expression (same as --filter
). This functionality is explained in extractor.*.filename
and extractor.*.directory
and also showcased in gallery-dl-example.conf.
They are a thing since 1.18.0/.1: 4cf4043, 84d2e64, fd00d47
You can select a different format string depending on the result of a Python expression (same as
--filter
). This functionality is explained inextractor.*.filename
andextractor.*.directory
and also showcased in gallery-dl-example.conf.
Huh. That's really interesting. I have a question about filters actually. It seems like the way it currently works is that it extracts all the posts and then compares the filter against it, which is the main reason I don't use an id comparison filter for Twitter since it would have to comb 3000+ tweets for each user.
But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False? That would also make it so that all the metadata.event post json files wouldn't be redownloaded each time.
I figured I'd just ask about it here instead of submitting a feature request since I've had an unfortunate habit of submitting feature requests that were already possible in ways I didn't expect.
But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False?
You can call abort()
or terminate()
in all gallery-dl Python expressions.
(including filename
conditions, even if that doesn't make much sense)
For example --filter "date >= datetime(2021) or terminate()"
The potential problem here that this exits as soon as it finds a file with a date
before 2021, even though there could be files/posts after that that are from 2021. As long as you can guarantee all files have the correct and expected order, that works fine, but this is Twitter we are talking about.
Also why not use the regular --abort
or --terminate
?
That would also make it so that all the metadata.event post json files wouldn't be redownloaded each time.
There's a filter
for post processors that makes it only run if the expression is True
(3cbbefd4)
But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False?
You can call
abort()
orterminate()
in all gallery-dl Python expressions. (includingfilename
conditions, even if that doesn't make much sense)For example
--filter "date >= datetime(2021) or terminate()"
The potential problem here that this exits as soon as it finds a file with a
date
before 2021, even though there could be files/posts after that that are from 2021. As long as you can guarantee all files have the correct and expected order, that works fine, but this is Twitter we are talking about.Also why not use the regular
--abort
or--terminate
?
I do for most services, but for other services it hasn't been so straightforward.
Right now I want to revise how I handle Twitter. Currently, to keep the amount of duplicates down due to retweets, I use {tweet_id}
with extractor.twitter.retweets=original
and put them into {author[id]}
directories, then an exec script hardlinks them into {author[name]}
directories. Unfortunately, this has has the side effect of making retweets trigger against skip=abort and filters with great frequency, so I use abort:100. That was fine at first, but it makes Twitter take the longest to scrape by far.
Because retweets=original retrieves the metadata for the original tweet, the retweet_id will match the tweet_id, my current plan is to use this new information to have a filter check if the retweet_id is 0. If it isn't, then it'll do its comparison check against the tweet_id, effectively giving retweets a free pass. And, as soon as it hits the last recorded tweet_id from the user, it'll abort.
There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top, it'll make it so it's the first retrieved tweet, and the extractor will abort immediately. Knowing that you can call abort() and terminate() in filters is helpful, but I would need the functionality of abort:2 to get past what's pinned.
Does twitter expose whether a tweet is a pinned tweet in their API? If so, then that can be remedied by having that added to the metadata and checking against that, too. I could also have my script retrieve the id of the first tweet from a user and change the filter to have that id be given a free pass too, which would be made a little easier with that new environment variable functionality you added.
There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top.
There's now a 'pinned' option to disable pinned Tweets: 9156e90f
Does twitter expose whether a tweet is a pinned tweet in their API?
It does, but actually setting a flag only for the pinned version of that Tweet and not accidentally also for the real, regular Tweet entry might be more trouble than it's worth.
There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top.
There's now a 'pinned' option to disable pinned Tweets: 9156e90
Does twitter expose whether a tweet is a pinned tweet in their API?
It does, but actually setting a flag only for the pinned version of that Tweet and not accidentally also for the real, regular Tweet entry might be more trouble than it's worth.
That works, thank you.
The configuration documentation says that downloader.progress outputs to stderr, but it actually seems to be outputting to stdout. Is the documentation out of date, or is this a bug?
The configuration documentation says that downloader.progress outputs to stderr
I don't think it does, but you are right in that the downloader progress should go to stderr and not stdout as it does right now.
The configuration documentation says that downloader.progress outputs to stderr
I don't think it does, but you are right in that the downloader progress should go to stderr and not stdout as it does right now.
It's on line 2714: https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#L2714 (This doesn't jump to the line for some reason)
Here, the edit interface jumps to the line properly: https://github.com/mikf/gallery-dl/edit/master/docs/configuration.rst#L2714
But that only describes where logging messages go. It has nothing to do with any downloader or its progress bar.
But that only describes where logging messages go. It has nothing to do with any downloader or its progress bar.
I see, my mistake then.
For exhentai (or other sites) is it possible to make it so that once the image limit is reached or the download stopped due to some error log/print to the screen all the remaining URLs along with the current one that was downloading?
Just curious, but there's currently no way to have gallery-dl download in ascending order, is there? It'd obviously have to gather every post first and then reverse the order, but there have been instances where I felt it more appropriate to start ascending from the oldest rather than descending from the newest.
Why does --ugoira-conv
use VP8 by default instead of VP9?
@mikf Congrats on making it into the 5K stargazers club on GitHub! 🥂
@mikf I just noticed something from the configuration.rst
docs..
The extractor.*.cookies
option supports three types, gallery-dl's Path
, as well as the built-ins object
and list
.
As I understand it, these types are used to distinguish the functionality, or rather the "source" of the cookies used.
--cookies-from-browser
option from the CLIBut gallery-dl's Path
can also be used in two different ways, because it also supports two different types: string
and list
(i.e. list
of string
values)
Here's the thing, what happens when extractor.*.cookies
gets set to a list?
I assume this always enables the automatic cookies from browser functionality, then?
In other words, this is occluding the option to use a list
as a Path
here, correct?
I mean, this is not really a problem, but maybe it should be mentioned.. 😅
You are correct, that's something I missed when writing the code and/or docs. I didn't account for a cookies.txt path possibly being a list
.
I wouldn't change anything about it though, except maybe update the type for cookies.txt paths.
Is there any way to create txt/html file for each 4chan/thebarchive/etc.. thread? and download thumbnail instead if source file not found ?
No to both of your questions. You can try https://github.com/bibanon/BASC-Archiver or similar to archive everything, including HTML.
I am using gallery-dl to backup some booru sites, but shoving millions of files into one directory quickly turns it into a "no humans zone" that will hang any file manager for several minutes. Does anyone have advice for how to manage this situation? I was thinking of splitting them into folders with a few thousand files each based on post IDs (say 1-999, 1000-1999, etc), but I don't know how I would do that. Or maybe something like the first three characters of the MD5, which would be easier to implement (could probably use something like {md5:[0:2]} ) but that makes finding specific post numbers harder.
This is how I split every 5000 image IDs into their own folder:
"directory":
[
"\fE str(id // 5000 * 5000 + 5000)"
],
You can replace 5000 with some other amount if you want.
Add developer documentation on how to make an extractor, and add metadata to it.
Is there a way to download an Instagram profile using the user ID??
@Fukitsu As with Twitter, you can use id:<id>
as username in input URLs.
https://www.instagram.com/instagram/
# is the same as
https://www.instagram.com/id:25025320/
@Fukitsu As with Twitter, you can use
id:<id>
as username in input URLs.https://www.instagram.com/instagram/ # is the same as https://www.instagram.com/id:25025320/
Will the folders be saved with the id as the directory name or the username?
This depends on your "directory"
setting..
The available metadata fields will be the same for both username and user ID (except tagged_username
and tagged_full_name
for the Tagged extractor)
I set up a postprocessor in my config to download metadata for Discord posts on kemono.party, but it is also activating the postprocessor I have set up for all other kemono.party downloads. They differ in filename, the generic one is not useful for Discord posts and vice versa. I saw in the documentation that this is intentional, but I can't figure out how to disable the generic kemono.party postprocessor only when downloading Discord. Can this be done or will I have to copy my default kemono.party postprocessor for every other subcategory and remove the sitewide one?
Add
"filter": "subcategory != 'discord'"
to the sitewide one and it should not trigger for Discord posts.
Is there a way to use range
or chapter-range
when specifying input files? Neither gallery-dl -i file.txt --range x-y
nor gallery-dl -i file.txt --chapter-range x-y
work
You mean selecting URLs from an --input-file
with a --range
-like option? Not possible with gallery-dl at the moment.
--range
and --chapter-range
in this case only affect the results of the input file URLs, but not the URLs themselves.
FYI
https://blog.gitter.im/2023/01/16/gitter-is-going-fully-native-matrix-in-feb-2023/
[..] replace the old Gitter app with a Gitter-customised version of Element during the week of Feb. 6th 2023.
They're planning to keep all content of existing Gitter rooms, but we'll see..
Element, this means that one should still be able to sign-in there with an existing GitHub account. Although jumping through some authorization hoops again, probably.
@mikf I also noticed there's no link to gallery-dl's Gitter at the moment? There's still a link in README.rst
, but it seems it is not referenced (anymore)?
Not sure, Gitter seems pretty dead to me, anyway. Not just gallery-dl, even other, much bigger, projects don't have any activity in their Gitter rooms.
Dunno, maybe time to use Discord? What do you guys think?
How do I make gallery-dl retry 404s? I've put retry-codes": [400, 404, 408]
under the extractor and "http": { "retry-codes": [400, 404, 408] }
under downloader but still doesn't seem to work
@Hrxn Regarding Gitter, I haven't actually visited the page for roughly a year, so I've removed its link in the README page for the time being.
Discord is something I was always against, but I guess that's the platform that's "in" at the moment so here is an invite link to some sort of gallery-dl Discord server: https://discord.gg/ZytBUdYq7n. It will eventually find its way into the README, but there is still a lot to do.
I'm also currently looking into registering #gallery-dl
on libera.chat.
@Fukitsu
If you are using v1.24.4, retry-codes
only works for file downloads.
v1.24.5 will add general support for all HTTP requests once it is out.
@mikf
Just wondered, if you set both "path-restrict"
and "path-replace"
to some "conflicting value" at the same time, e.g.
"path-restrict": {" ": "_"}
"path-replace": "."
does "path-restrict"
always have precedence in such a scenario, even if "path-replace"
is set below/later?
As I assume, if you set such a character->replacement character association in the object for "path-restrict"
, does the "path-replace"
option even matter?
Unlike, maybe, if it's done like this?
"path-restrict": " "
"path-replace": "."
What happens if you set these options at the base level, and then use "path-replace"
again at any "deeper"/more specific category level? Does it overwrite the replacement char then? Or if you use "path-restrict"
again, can you update/overwrite specific replacement association options this way?
Sorry for asking such whacky questions.. 😄
PS: Maybe pinning the latest "Question, Feedback and Suggestions" thread at the top of the issues would be beneficial? What do you thing? Or maybe too much of a distraction?
Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue. There is also https://gitter.im/gallery-dl/main if that seems more appropriate.
Links to older issues: #11, #74