Open mikf opened 8 months ago
For most sites I'm able to sort files into year/month folders like this:
"directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
However for redgifs it doesn't look like there's a date keyword available for directory
. There's only a date keyword available for filename
. Is this an oversight?
Yep, that's a mistake that happened when adding support for galleries in 5a6fd802.
Will be fixed with the next git push
.
edit: https://github.com/mikf/gallery-dl/commit/82c73c77b04fe21766c826852c68dde9b327dfbe
There's a typo in extractor.reddit.client-id & .user-agent
:
"I'm not a rebot"
There's also another typo in extractor.reddit.client-id & .user-agent
, "reCATCHA"
Can you grab all the media from quoted tweets? Example.
Regarding typos, thanks for pointing them out. I would be surprised if there aren't at least 10 more somewhere in this file.
@biggestsonicfan
This is implemented as a search for quoted_tweet_id:…-
on Twitter's end.
I've added an extractor for it similar to the hashtags one (https://github.com/mikf/gallery-dl/commit/40c0553523bb28790de0e6a07a978a42e2be88c7), but it only does said search under the hood.
Normally %-encoded characters in the URL get converted nicely when running gallery-dl, eg.
https://gelbooru.com/index.php?page=post&s=list&tags=nighthawk_%28circle%29
gives me a nighthawk_(circle)
folder
but for this url:
https://gelbooru.com/index.php?page=post&s=list&tags=shin%26%23039%3Bya_%28shin%26%23039%3Byanchi%29
I'm getting a shin'ya_(shin'yanchi)
folder. Shouldn't I be getting a shin'ya_(shin'yanchi)
folder instead?
EDIT: Actually, I think there's just something wrong with that URL. I had it saved for a long time and searching that tag normally gives a different URL (https://gelbooru.com/index.php?page=post&s=list&tags=shin%27ya_%28shin%27yanchi%29
). I still got valid posts from the weird URL so I didn't think much of it.
%28
and so on are URL escaped values, which do get resolved.
#039;
is the HTML escaped value for '
.
You could use {search_tags!U}
to convert them.
Is there support to remove metadata like this?
gallery-dl -K https://www.reddit.com/r/carporn/comments/axo236/mean_ctsv/
...
preview['images'][N]['resolutions'][N]['height']
144
preview['images'][N]['resolutions'][N]['url']
https://preview.redd.it/mcerovafack21.jpg?width=108&crop=smart&auto=webp&s=f8516c60ad7fa17c84143d549c070738b8bcc989
preview['images'][N]['resolutions'][N]['width']
108
...
Post-processor:
"filter-metadata":
{
"name": "metadata",
"mode": "delete",
"event": "prepare",
"fields": ["preview[images][0][resolutions]"]
}
I've tried a few variations but no dice.
"fields": ["preview[images][][resolutions]"]
"fields": ["preview[images][N][resolutions]"]
"fields": ["preview['images'][0]['resolutions']"]
Hello, I left a comment in #4168 . Does the _pagination
method of the WeiboExtractor
class in weibo.py
return when data["list"]
is an empty list?
When I used gallery-dl to batch download the album page of Weibo, the download also appeared incomplete.
Through testing on the web page, I found that Weibo's getImageWall
api sometimes returns an empty list when the image is not completely loaded. I think this may be what causes gallery-dl to terminate the download.
@taskhawk
fields
selectors are quite limited and can't really handle lists.
You might want to use a python
post processor (example) and write some code that does this.
def remove_resolutions(metadata):
for image in metadata["preview"]["images"]:
del image["resolutions"]
(untested, might need some check whether preview
and/or images
exists)
@YuanGYao Yes, the code currently stops when Weibo's API returns no more results (empty list). This is probably not ideal, as I've hinted at in https://github.com/mikf/gallery-dl/issues/4168#issuecomment-1589119191
@mikf
Well, I think for Weibo's album page, since_id
should be used to determine whether the image is fully loaded.
I updated my comment in #4168(comment) and attached the response returned by Weibo's getImageWall
API.
I think this should help solve this problem.
Not sure if I'm missing something, but are directory specific configurations exclusive to running gallery-dl via the executable?
Basically, I have a directory for regular tags, and a directory for artist tags. For regular tags I use "directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
since the tag number is manageable. For artist tags though, there's way more of them so this "directory": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"]
makes more sense.
So right now the only way I know to get this per-directory configuration to work, is to copy the gallery-dl executable everywhere I want to use a master configuration override. Am I missing something? It feels like there should be a better way.
Huh? No, the configuration works always in the same way. You're simply using different configuration files?
@Hrxn
From the readme:
When run as executable, gallery-dl will also look for a gallery-dl.conf file in the same directory as said executable.
It is possible to use more than one configuration file at a time. In this case, any values from files after the first will get merged into the already loaded settings and potentially override previous ones.
I want to override my master configuration %APPDATA%\gallery-dl\config.json
in specific directories with a local gallery-dl.conf
but it seems like that's only possible with the standalone executable.
You can load additional configuration files from the console with:
-c, --config FILE Additional configuration files
You just need to specify the path to the file and any options there will overwrite your main configuration file.
Edit: From my understanding, yeah, automatic loading of local config files in each directory is only possible having the standalone executable in each directory. Are different directory options the only thing you need?
@taskhawk
Thanks, that's exactly what I was looking for! Guess I didn't read the documentation thoroughly enough.
For now the only thing I'd want to override is the directory structure for artist tags. I don't think it's possible to determine from the metadata alone if a given tag is the name of an artist or not, so I thought the best way to go about it is to just have a separate directory for artists, and use a configuration override. So yeah, loading that override with the -c flag works great for that purpose, thanks again!
You kinda can, but you need to enable tags
for Gelbooru in your configuration to get them, which will require an additional request:
"gelbooru": {
"directory": {
"search_tags in tags_artists": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
"" : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
},
"tags": true
},
Set "tags": true
in your config and run a test with gallery-dl -K "https://gelbooru.com/index.php?page=post&s=list&tags=TAG"
so you can see the tags_*
keywords.
Of course, this depends on the artists being correctly tagged. Not sure if it happens on Gelbooru, but at least in other boorus and booru-like sites I've come across posts with the artist tagged as a general tag instead of an artist tag. Another limitation is that your search tag can only include one artist at a time, doing more will require a more complex expression to check all tags are present in tags_artists
.
What I do instead is that I inject a keyword to influence where it will be saved, like this:
gallery-dl -o keywords='{"search_tags_type":"artists"}' "https://gelbooru.com/index.php?page=post&s=list&tags=ARTIST"
And in my config I have
"gelbooru": {
"directory": ["boorus", "{search_tags_type}", "{search_tags}"]
},
You can have:
"gelbooru": {
"directory": {
"search_tags_type == 'artists'": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
"" : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
}
},
You can do this for other tag types, like general, copyright, characters, etc.
Because it's a chore to type that option every time I made a wrapper script, so I just call it like this because artists is my default:
~/script.sh "TAG"
For other tag types I can do:
~/script.sh --copyright "TAG"
~/script.sh --characters "TAG"
~/script.sh --general "TAG"
Thanks for pointing out there's a tags option available for the gelbooru extractor. I already used it in the kemono extractor to get the name of the artist, but it didn't occur to me that gelbooru might also have such an option (and just accepted that the tags aren't categorized).
For artists I store all the url's in their respective gelbooru.txt, rule34.txt, etc files like so:
https://gelbooru.com/index.php?page=post&s=list&tags=john_doe
https://gelbooru.com/index.php?page=post&s=list&tags=blue-senpai
https://gelbooru.com/index.php?page=post&s=list&tags=kaneru
.
.
.
And then just run gallery-dl -c gallery-dl.conf -i gelbooru.txt
. Since the search_tags ends up being the artist anyway, getting tags_artists is probably not worth the extra request. Same for general tags, and copyright tags, in their respective directories. With this workflow I can't immediately see where I'd be able to utilize keyword injection, but it's definitely a useful feature that I'll keep in mind.
When I'm making an extractor, what do I do if the site doesn't have different URL patterns for different page types? Every single page is just a numerical ID that could be a forum post, image, blog post, or something completely different.
@Wiiplay123 You handle everything with a single extractor and decide what type of result to return on the fly. The gofile
code is a good example for this I think, or aryion
.
Hi, what options should I use in my config file to change the format of dates in metadata files? I would like to use "%Y-%m-%dT%H:%M:%S%z"
for the values of "date" and "published" (from coomer/kemono downloads).
And would it also be possible to do this for json files that ytdl creates? I downloaded some videos with gallery-dl but the dates got saved as "upload_date": "20230910"
and "timestamp": 1694344011
, so I think it might be better to convert the timestamp to a date to get a more precise upload time, but I'm not sure if it's possible to do that either.
If the field is simply called date
:
{date:Olocal/%Y-%m-%dT%H:%M:%S}
Note: You cannot use something like %H:%M:%S
in filenames, because :
is not allowed (on Windows/NTFS).
(Good practice to also avoid this on Linux etc. because a) for compat reasons and b) :
is path entry separator on Linux)
You can also change the format options of a post-processor, yes, you don't have to keep the default JSON created by gallery-dl.
Timestamps in epoch format can be converted with something like datetime.datetime.fromtimestamp(ts, datetime.UTC)
, I think..
@Hrxn
You can also change the format options of a post-processor,
To do that should I add {date:Olocal/%Y-%m-%dT%H:%M:%S}
and datetime.datetime.fromtimestamp(ts, datetime.UTC)
under "postprocessors":
in my configuration file? How exactly should I do that? Sorry, I don't really know what I am doing.
You need to add it like this to your configuration file before the postprocessor for writing metadata:
"kemonoparty": {
"#": "...",
"postprocessors": [
{
"name": "metadata",
"mode": "modify",
"fields": {
"date": "{date:Olocal/%Y-%m-%dT%H:%M:%S}"
}
},
{
"name": "metadata",
"directory": ".metadata"
}
]
},
The event
value for the postprocessors for modifying metadata and writing metadata should be the same. If you are just using the default values then there's no need to adjust that.
Isn't published
already in the format you wanted? Also, where are upload_date
and timestamp
coming from? They don't seem to be default keywords for Kemono, I think.
Is there a way to skip links that redirect to a 404 page while still giving a 200 OK status? The 404 page is the same each time.
@taskhawk Thanks, I wanted to add %z
to the end of %Y-%m-%dT%H:%M:%S
to get information on the time zone offset, but no extra information got added when I included %z
, so I'm guessing that kemono doesn't have any information on the time zone.
Isn't
published
already in the format you wanted?
Yes, sorry I didn't realize that published
and date
were the same date.
Also, where are
upload_date
andtimestamp
coming from?
upload_date
and timestamp
were from the json file of a video that I downloaded from TikTok with the ytdl extractor. I think that upload_date
doesn't include the upload time of a video (only the date), so I was hoping that I could use a gallery-dl postprocessor option to convert timestamp
into %Y-%m-%dT%H:%M:%S%z
.
@Wiiplay123 This can be done by assigning a function to a _http_validate
field in a file's metadata (example), which then gets called to check the initial response. It should return True
/False
for an valid/invalid response. You can realistically only check status code, history, and headers since accessing the response's content will have weird site effects.
@I-seah
to get information on the time zone offset,
All date
s are in UTC/GMT and do not have any timezone information attached to them.
I was hoping that I could use a gallery-dl postprocessor option to convert timestamp into %Y-%m-%dT%H:%M:%S%z.
This might work.
{
"name": "metadata",
"mode": "modify",
"filter": "locals().get('timestamp')",
"fields": {
"date_from_timestamp": "{timestamp!d:%Y-%m-%dT%H:%M:%S}"
}
},
Is there any way to download the announcements page posts for a per user search on kemono? Such as: https://kemono.su/fanbox/user/EXAMPLE/announcements
They're text posts but sometimes have info or similar that'd be nice to have backed up as well with the tool, apologies if it's possible and I'm missing it ^^:
Something I have run into quite a lot lately is twitter logging me out somewhere in the middle of the job, then making me do a bot check. Is there a way of making it halt when it reaches this error, or a way of avoiding getting kicked out? [twitter][error] 401 Unauthorized (Could not authenticate you)
When trying to download a batch of images from Twitter by tweet IDs, it has to be done one by one (in term of request), right?
I knew the classic v1.1 API has/had an endpoint to query tweets in batch by a list of IDs, but I assume that does not exist for the GraphQL API we're using?
@britefire Announcements aren't supported yet, only DMs and comments. I'll look into it.
@WarmWelcome
Maybe with the locked
option (#5300), but it doesn't seem to work for some of these errors (#5370).
I'll probably have to implement some form of "cursor" support like Instagram has.
@fireattack Yeah, there doesn't seem to be a way of fetching multiple Tweets by ID with a single API call using the GraphQL API. It only implements what's needed for browsing the site and I haven't seen it needing to fetch multiple Tweets that aren't in some timeline or feed.
@WarmWelcome Maybe with the
locked
option (#5300), but it doesn't seem to work for some of these errors (#5370). I'll probably have to implement some form of "cursor" support like Instagram has.
I was keeping an eye out for something like this for a week, and only skipped checking today and now it appears lol. That's exactly what I need. I'll have to check it out sometime soon. Thank you
What is the best way to use gdl as a module?
I currently came up with something like
import gallery_dl
import sys
def dump_json(obj):
...
return path
options = { ... }
options_path = dump_json(options)
url = '...'
sys.argv = ["gallery-dl", url, '--config', options_path]
gallery_dl.main()
which is kinda ugly but gets the job done. Just wondering if there is a better way.
Edit:
it actually does not work well. Running it once is fine, but then it either terminate the entire python process, or print everything twice (??) sometimes.
import gallery_dl
import sys
def check_version():
sys.argv = ["gallery-dl", '--version']
print(sys.argv)
gallery_dl.main()
def main():
print('Let us check version...')
check_version()
# we never reach this point
print('Let us check version again...')
check_version()
if __name__ == "__main__":
main()
Additional question -- how can I make archive file relative to download destination?
I currently have configuration of
{
"extractor": {
"base-directory": "./",
"url-metadata": "gdl_url",
"path-metadata": "gdl_path",
"instagram": {
"directory": [""],
"cookies": [
"chrome",
null,
null,
null,
".instagram.com"
],
"skip": "abort:4",
"archive": "_downloaded.sqlite3",
"postprocessors": [
{
"name": "metadata",
"event": "post-after",
"filename": "_metadata.jsonl",
"mode": "jsonl",
"open": "a"
}
]
}
}
}
And when I download certain instagram profile by gallery-dl https://instagram.com/USER/ -d "C:\mylocaltion\test\"
, it will download images/videos and _metadata.jsonl
to that destination.
However, the _downloaded.sqlite3
will bein CWD instead.
Take look at #642. There are several examples in there.
In your case, you should import gallery_dl.config
and gallery_dl.job
, set your config options, and run a Job
with your input URL.
from gallery_dl import config, job
options = { ... }
url = '...'
config._config.update(options)
dl = job.DownloadJob(url)
dl.run()
but then it either terminate the entire python process
argparse
raises SystemExit
for --version
.
or print everything twice (??) sometimes.
No idea either.
how can I make archive file relative to download destination?
All relative paths are always relative to CWD.
You could change it beforehand, or you might be able to do this by enabling path-metadata
and using it in the archive path as a format string: {gdl_path.realdirectory}...
. This might result in some weird behavior, though.
Thanks! That helps tremendously. I ended up just using dynamic
config.set((), 'base-directory', str(user_dir))
config.set(('extractor', 'instagram'), 'archive', str(downloaded))
to set paths for all these companion files.
Also the double-print is related to output.initialize_logging()
being called twice when I was using .main()
, which now I ensured to only call it once globally.
I'm curious why even if you don't call output.initialize_logging()
at all and straightly go dl.run()
(as your example), it still generates INFO
-level log?
I'm curious why even if you don't call
output.initialize_logging()
at all and straightly godl.run()
(as your example), it still generatesINFO
-level log?
It doesn't, at least not for me. I get WARNING and ERROR logging messages, since that seems to be the default level for logging.getLogger()
objects, but not INFO or DEBUG. Maybe this is different for your stdlib implementation.
I'm curious why even if you don't call
output.initialize_logging()
at all and straightly godl.run()
(as your example), it still generatesINFO
-level log?It doesn't, at least not for me. I get WARNING and ERROR logging messages, since that seems to be the default level for
logging.getLogger()
objects, but not INFO or DEBUG. Maybe this is different for your stdlib implementation.
Maybe I don't use the word right.
I meant that even without initialize_logging()
, it still prints the files you downloaded. I would assume if you don't initialize, it won't print anything at all since no logger was set.
Downloaded file output is separate from logging messages. Their paths get directly written to sys,stdout
and can be controlled with output.mode
.
hi, how do I download files from oldest to newest?
I'm using this:
https://www.instagram.com/{my_user}/saved/all-posts/
and I need to start downloading from the oldest posts first, how do I do that?
Hi! Is it possible to download posts from Pixiv from a specified bookmark page? For example I want to download not all bookmarks but only from page 2. I tried this URL /bookmarks/artworks?p=2
but gallery-dl still downloads all my bookmarks.
@mikf
Would a formatting option like "{title!t:?/__/R__//}"
be legitimate?
Would an order of operations like this
be possible?
@throwaway26425
Not possible, especially not with how IG returns its results.
You could theoretically grab all download links (-g
) from newest to oldest, reverse their order, and then download those.
@JailSeed
Not really supported. You could use --range
, but that selects by file count and not post count.
@Hrxn
This would work, but it probably crashes when there's no title
. !t
would need to be applied after ?
or at least after some form of check that title
is a string.
This could be more reliably done in an f-string:
\fF …{title.strip().replace("__", "") + "__" if title else ""}…
@mikf Thanks, that helps.
Agree about the f-string part, but I think in this case the site always provides a title
, so I don't think there would be anything that contends against continuing to use "{title!t:?/__/R__//}"
..
@mikf The scenario: Submission on reddit, hosted on redgifs, but it's actually an image (yes, I know.. edge case. But I've seen it at least once)
I believe it should be possible to solve this with a conditional directory setting using what we already got in gallery-dl, but I'm not sure.
Accessing metadata coming from reddit can be done with locals().get('_reddit_)
, but I'm unsure if we can proceed from there on without breaking..
Example from -K
on a reddit link:
is_video
False
but at the same time
media['oembed']['type']
video
and
post_hint
rich:video
which.. totally makes sense..
The easiest way would probably be something like this
"directory": {
"'_reddit_' in locals() and extension in ('mp4', 'webm')" : ["Video"],
"'_reddit_' in locals() and extension in ('gif', 'apng')" : ["Gif"],
"'_reddit_' in locals() and extension in ('jpg', 'png')" : ["Picture"],
in using extension
from redgif, which already exists! But it does not work for "directory"
, because it's a metadata entry for "file and filter". Would it be very complicated to make extension
also available as a directory metadata value?
Wouldn't a classify
post processor work here?
It wouldn't really be complicated to make extension
available for directories, but it is kind of wrong given the current "directory" semantics.
classify
would work, and I'm using it for everything in redgifs except for the image
subcategory so that I can differentiate between downloading a single item directly on redgifs vs a submission on reddit hosted on redgifs, and I'm not sure how to achieve that otherwise.
config excerpt (a bit simplified), giving me the output paths I'm using for a while now and would like to keep:
"redgifs":
{
"image":
{
"directory": {
"'_reddit_' in locals()": ["+Clips"],
"locals().get('bkey')" : ["Redgifs", "Clips", "{bkey}"],
"" : ["Redgifs", "Clips", "Unsorted"]
}
}
}
I'm using "parent-directory": true
and "parent-metadata": "_reddit_"
for reddit, obviously, and the result is basically this:
input URL from.. | Output Destination |
---|---|
redgifs | "base-directory" / Redgifs / bkey \| Unsorted / \<filename with metadata from redgifs only> |
"base-directory" / Reddit / Submissions / \<subreddit title> / bkey \| Unsorted / +Clips / [1] |
[1] = \<filename with metadata from redgifs and from _reddit_
>
This is an example with a direct submission link from reddit, but it works the same with different categories from reddit (with a different "prefix" name instead of Submissions
, of course)
It wouldn't really be complicated to make
extension
available for directories, but it is kind of wrong given the current "directory" semantics.
Ah, okay. I thought this would be just one more metadata field, basically, without breaking anything. Best to forget this approach then, I'll see if I can come up with another one.
Wouldn't it be possible to use reddit>redgifs
as category to distinguish Reddit-posted Redgifs links from regular ones and only use the post processor there?
"reddit>redgifs":
{
"image":
{
"directory": ["+Clips"],
"postprocessors": ["classify"]
}
},
"redgifs":
{
"image":
{
"directory": {
"locals().get('bkey')" : ["Redgifs", "Clips", "{bkey}"],
"" : ["Redgifs", "Clips", "Unsorted"]
}
}
}
Good idea. Almost forgot that this option exists.
To be honest, I've never used this "new" extractor>child-extractor
option syntax.
Seems like it should be the right fit for such a task. But does this change anything with regard to how the "archive"
option works? Or is this just an additional step, i.e. the options in "reddit>redgifs"
, for example, get simply added "on the top", and everything else like archive options etc. is kept as is?
Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.
Links to older issues: #11, #74, #146.