mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
12k stars 978 forks source link

Questions, Feedback, and Suggestions #4 #5262

Open mikf opened 8 months ago

mikf commented 8 months ago

Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.

Links to older issues: #11, #74, #146.

BakedCookie commented 8 months ago

For most sites I'm able to sort files into year/month folders like this:

"directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]

However for redgifs it doesn't look like there's a date keyword available for directory. There's only a date keyword available for filename. Is this an oversight?

mikf commented 8 months ago

Yep, that's a mistake that happened when adding support for galleries in 5a6fd802. Will be fixed with the next git push.

edit: https://github.com/mikf/gallery-dl/commit/82c73c77b04fe21766c826852c68dde9b327dfbe

taskhawk commented 8 months ago

There's a typo in extractor.reddit.client-id & .user-agent:

"I'm not a rebot"

the-blank-x commented 8 months ago

There's also another typo in extractor.reddit.client-id & .user-agent, "reCATCHA"

biggestsonicfan commented 8 months ago

Can you grab all the media from quoted tweets? Example.

mikf commented 8 months ago

Regarding typos, thanks for pointing them out. I would be surprised if there aren't at least 10 more somewhere in this file.

@biggestsonicfan This is implemented as a search for quoted_tweet_id:…- on Twitter's end. I've added an extractor for it similar to the hashtags one (https://github.com/mikf/gallery-dl/commit/40c0553523bb28790de0e6a07a978a42e2be88c7), but it only does said search under the hood.

BakedCookie commented 8 months ago

Normally %-encoded characters in the URL get converted nicely when running gallery-dl, eg.

https://gelbooru.com/index.php?page=post&s=list&tags=nighthawk_%28circle%29 gives me a nighthawk_(circle) folder

but for this url: https://gelbooru.com/index.php?page=post&s=list&tags=shin%26%23039%3Bya_%28shin%26%23039%3Byanchi%29

I'm getting a shin'ya_(shin'yanchi) folder. Shouldn't I be getting a shin'ya_(shin'yanchi) folder instead?

EDIT: Actually, I think there's just something wrong with that URL. I had it saved for a long time and searching that tag normally gives a different URL (https://gelbooru.com/index.php?page=post&s=list&tags=shin%27ya_%28shin%27yanchi%29). I still got valid posts from the weird URL so I didn't think much of it.

mikf commented 8 months ago

%28 and so on are URL escaped values, which do get resolved. #039; is the HTML escaped value for '.

You could use {search_tags!U} to convert them.

taskhawk commented 8 months ago

Is there support to remove metadata like this?

gallery-dl -K https://www.reddit.com/r/carporn/comments/axo236/mean_ctsv/

...
preview['images'][N]['resolutions'][N]['height']
  144
preview['images'][N]['resolutions'][N]['url']
  https://preview.redd.it/mcerovafack21.jpg?width=108&crop=smart&auto=webp&s=f8516c60ad7fa17c84143d549c070738b8bcc989
preview['images'][N]['resolutions'][N]['width']
  108
...

Post-processor:

"filter-metadata":
    {
      "name": "metadata",
      "mode": "delete",
      "event": "prepare",
      "fields": ["preview[images][0][resolutions]"]
    }

I've tried a few variations but no dice.

"fields": ["preview[images][][resolutions]"]
"fields": ["preview[images][N][resolutions]"]
"fields": ["preview['images'][0]['resolutions']"]
YuanGYao commented 8 months ago

Hello, I left a comment in #4168 . Does the _pagination method of the WeiboExtractor class in weibo.py return when data["list"] is an empty list? When I used gallery-dl to batch download the album page of Weibo, the download also appeared incomplete. Through testing on the web page, I found that Weibo's getImageWall api sometimes returns an empty list when the image is not completely loaded. I think this may be what causes gallery-dl to terminate the download.

mikf commented 8 months ago

@taskhawk fields selectors are quite limited and can't really handle lists. You might want to use a python post processor (example) and write some code that does this.

def remove_resolutions(metadata):
    for image in metadata["preview"]["images"]:
        del image["resolutions"]

(untested, might need some check whether preview and/or images exists)

@YuanGYao Yes, the code currently stops when Weibo's API returns no more results (empty list). This is probably not ideal, as I've hinted at in https://github.com/mikf/gallery-dl/issues/4168#issuecomment-1589119191

YuanGYao commented 8 months ago

@mikf Well, I think for Weibo's album page, since_id should be used to determine whether the image is fully loaded. I updated my comment in #4168(comment) and attached the response returned by Weibo's getImageWall API. I think this should help solve this problem.

BakedCookie commented 8 months ago

Not sure if I'm missing something, but are directory specific configurations exclusive to running gallery-dl via the executable?

Basically, I have a directory for regular tags, and a directory for artist tags. For regular tags I use "directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"] since the tag number is manageable. For artist tags though, there's way more of them so this "directory": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"] makes more sense.

So right now the only way I know to get this per-directory configuration to work, is to copy the gallery-dl executable everywhere I want to use a master configuration override. Am I missing something? It feels like there should be a better way.

Hrxn commented 8 months ago

Huh? No, the configuration works always in the same way. You're simply using different configuration files?

BakedCookie commented 8 months ago

@Hrxn

From the readme:

When run as executable, gallery-dl will also look for a gallery-dl.conf file in the same directory as said executable.

It is possible to use more than one configuration file at a time. In this case, any values from files after the first will get merged into the already loaded settings and potentially override previous ones.

I want to override my master configuration %APPDATA%\gallery-dl\config.json in specific directories with a local gallery-dl.conf but it seems like that's only possible with the standalone executable.

taskhawk commented 8 months ago

You can load additional configuration files from the console with:

-c, --config FILE           Additional configuration files

You just need to specify the path to the file and any options there will overwrite your main configuration file.

Edit: From my understanding, yeah, automatic loading of local config files in each directory is only possible having the standalone executable in each directory. Are different directory options the only thing you need?

BakedCookie commented 8 months ago

@taskhawk

Thanks, that's exactly what I was looking for! Guess I didn't read the documentation thoroughly enough.

For now the only thing I'd want to override is the directory structure for artist tags. I don't think it's possible to determine from the metadata alone if a given tag is the name of an artist or not, so I thought the best way to go about it is to just have a separate directory for artists, and use a configuration override. So yeah, loading that override with the -c flag works great for that purpose, thanks again!

taskhawk commented 8 months ago

You kinda can, but you need to enable tags for Gelbooru in your configuration to get them, which will require an additional request:

    "gelbooru": {
      "directory": {
        "search_tags in tags_artists": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                           : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      },
      "tags": true
    },

Set "tags": true in your config and run a test with gallery-dl -K "https://gelbooru.com/index.php?page=post&s=list&tags=TAG" so you can see the tags_* keywords.

Of course, this depends on the artists being correctly tagged. Not sure if it happens on Gelbooru, but at least in other boorus and booru-like sites I've come across posts with the artist tagged as a general tag instead of an artist tag. Another limitation is that your search tag can only include one artist at a time, doing more will require a more complex expression to check all tags are present in tags_artists.

What I do instead is that I inject a keyword to influence where it will be saved, like this:

gallery-dl -o keywords='{"search_tags_type":"artists"}' "https://gelbooru.com/index.php?page=post&s=list&tags=ARTIST"

And in my config I have

    "gelbooru": {
      "directory": ["boorus", "{search_tags_type}", "{search_tags}"]
    },

You can have:

    "gelbooru": {
      "directory": {
        "search_tags_type == 'artists'": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                             : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      }
    },

You can do this for other tag types, like general, copyright, characters, etc.

Because it's a chore to type that option every time I made a wrapper script, so I just call it like this because artists is my default:

~/script.sh "TAG"

For other tag types I can do:

~/script.sh --copyright "TAG"
~/script.sh --characters "TAG"
~/script.sh --general "TAG"
BakedCookie commented 8 months ago

Thanks for pointing out there's a tags option available for the gelbooru extractor. I already used it in the kemono extractor to get the name of the artist, but it didn't occur to me that gelbooru might also have such an option (and just accepted that the tags aren't categorized).

For artists I store all the url's in their respective gelbooru.txt, rule34.txt, etc files like so:

https://gelbooru.com/index.php?page=post&s=list&tags=john_doe
https://gelbooru.com/index.php?page=post&s=list&tags=blue-senpai
https://gelbooru.com/index.php?page=post&s=list&tags=kaneru
.
.
.

And then just run gallery-dl -c gallery-dl.conf -i gelbooru.txt. Since the search_tags ends up being the artist anyway, getting tags_artists is probably not worth the extra request. Same for general tags, and copyright tags, in their respective directories. With this workflow I can't immediately see where I'd be able to utilize keyword injection, but it's definitely a useful feature that I'll keep in mind.

Wiiplay123 commented 8 months ago

When I'm making an extractor, what do I do if the site doesn't have different URL patterns for different page types? Every single page is just a numerical ID that could be a forum post, image, blog post, or something completely different.

mikf commented 8 months ago

@Wiiplay123 You handle everything with a single extractor and decide what type of result to return on the fly. The gofile code is a good example for this I think, or aryion.

I-seah commented 8 months ago

Hi, what options should I use in my config file to change the format of dates in metadata files? I would like to use "%Y-%m-%dT%H:%M:%S%z" for the values of "date" and "published" (from coomer/kemono downloads).

And would it also be possible to do this for json files that ytdl creates? I downloaded some videos with gallery-dl but the dates got saved as "upload_date": "20230910" and "timestamp": 1694344011, so I think it might be better to convert the timestamp to a date to get a more precise upload time, but I'm not sure if it's possible to do that either.

Hrxn commented 8 months ago

If the field is simply called date:

{date:Olocal/%Y-%m-%dT%H:%M:%S}

Note: You cannot use something like %H:%M:%S in filenames, because : is not allowed (on Windows/NTFS). (Good practice to also avoid this on Linux etc. because a) for compat reasons and b) : is path entry separator on Linux)

You can also change the format options of a post-processor, yes, you don't have to keep the default JSON created by gallery-dl.

Timestamps in epoch format can be converted with something like datetime.datetime.fromtimestamp(ts, datetime.UTC), I think..

I-seah commented 8 months ago

@Hrxn

You can also change the format options of a post-processor,

To do that should I add {date:Olocal/%Y-%m-%dT%H:%M:%S} and datetime.datetime.fromtimestamp(ts, datetime.UTC) under "postprocessors": in my configuration file? How exactly should I do that? Sorry, I don't really know what I am doing.

taskhawk commented 8 months ago

You need to add it like this to your configuration file before the postprocessor for writing metadata:

"kemonoparty": {
  "#": "...",
  "postprocessors": [
    {
      "name": "metadata",
      "mode": "modify",
      "fields": {
        "date": "{date:Olocal/%Y-%m-%dT%H:%M:%S}"
      }
    },
    {
      "name": "metadata",
      "directory": ".metadata"
    }
  ]
},

The event value for the postprocessors for modifying metadata and writing metadata should be the same. If you are just using the default values then there's no need to adjust that.

Isn't published already in the format you wanted? Also, where are upload_date and timestamp coming from? They don't seem to be default keywords for Kemono, I think.

Wiiplay123 commented 8 months ago

Is there a way to skip links that redirect to a 404 page while still giving a 200 OK status? The 404 page is the same each time.

I-seah commented 8 months ago

@taskhawk Thanks, I wanted to add %z to the end of %Y-%m-%dT%H:%M:%S to get information on the time zone offset, but no extra information got added when I included %z, so I'm guessing that kemono doesn't have any information on the time zone.

Isn't published already in the format you wanted?

Yes, sorry I didn't realize that published and date were the same date.

Also, where are upload_date and timestamp coming from?

upload_date and timestamp were from the json file of a video that I downloaded from TikTok with the ytdl extractor. I think that upload_date doesn't include the upload time of a video (only the date), so I was hoping that I could use a gallery-dl postprocessor option to convert timestamp into %Y-%m-%dT%H:%M:%S%z.

mikf commented 8 months ago

@Wiiplay123 This can be done by assigning a function to a _http_validate field in a file's metadata (example), which then gets called to check the initial response. It should return True/False for an valid/invalid response. You can realistically only check status code, history, and headers since accessing the response's content will have weird site effects.

@I-seah

to get information on the time zone offset,

All dates are in UTC/GMT and do not have any timezone information attached to them.

I was hoping that I could use a gallery-dl postprocessor option to convert timestamp into %Y-%m-%dT%H:%M:%S%z.

This might work.

    {
      "name": "metadata",
      "mode": "modify",
      "filter": "locals().get('timestamp')",
      "fields": {
        "date_from_timestamp": "{timestamp!d:%Y-%m-%dT%H:%M:%S}"
      }
    },
britefire commented 8 months ago

Is there any way to download the announcements page posts for a per user search on kemono? Such as: https://kemono.su/fanbox/user/EXAMPLE/announcements

They're text posts but sometimes have info or similar that'd be nice to have backed up as well with the tool, apologies if it's possible and I'm missing it ^^:

WarmWelcome commented 8 months ago

Something I have run into quite a lot lately is twitter logging me out somewhere in the middle of the job, then making me do a bot check. Is there a way of making it halt when it reaches this error, or a way of avoiding getting kicked out? [twitter][error] 401 Unauthorized (Could not authenticate you)

fireattack commented 8 months ago

When trying to download a batch of images from Twitter by tweet IDs, it has to be done one by one (in term of request), right?

I knew the classic v1.1 API has/had an endpoint to query tweets in batch by a list of IDs, but I assume that does not exist for the GraphQL API we're using?

mikf commented 8 months ago

@britefire Announcements aren't supported yet, only DMs and comments. I'll look into it.

@WarmWelcome Maybe with the locked option (#5300), but it doesn't seem to work for some of these errors (#5370). I'll probably have to implement some form of "cursor" support like Instagram has.

@fireattack Yeah, there doesn't seem to be a way of fetching multiple Tweets by ID with a single API call using the GraphQL API. It only implements what's needed for browsing the site and I haven't seen it needing to fetch multiple Tweets that aren't in some timeline or feed.

WarmWelcome commented 8 months ago

@WarmWelcome Maybe with the locked option (#5300), but it doesn't seem to work for some of these errors (#5370). I'll probably have to implement some form of "cursor" support like Instagram has.

I was keeping an eye out for something like this for a week, and only skipped checking today and now it appears lol. That's exactly what I need. I'll have to check it out sometime soon. Thank you

fireattack commented 8 months ago

What is the best way to use gdl as a module?

I currently came up with something like

import gallery_dl
import sys

def dump_json(obj):
    ...
    return path

options = { ... }
options_path = dump_json(options)
url = '...'
sys.argv = ["gallery-dl", url, '--config', options_path]
gallery_dl.main()

which is kinda ugly but gets the job done. Just wondering if there is a better way.

Edit:

it actually does not work well. Running it once is fine, but then it either terminate the entire python process, or print everything twice (??) sometimes.

import gallery_dl
import sys
def check_version():
    sys.argv = ["gallery-dl", '--version']
    print(sys.argv)
    gallery_dl.main()
def main():
    print('Let us check version...')
    check_version()
    # we never reach this point
    print('Let us check version again...')
    check_version()
if __name__ == "__main__":
    main()
fireattack commented 8 months ago

Additional question -- how can I make archive file relative to download destination?

I currently have configuration of

{
    "extractor": {
        "base-directory": "./",
        "url-metadata": "gdl_url",
        "path-metadata": "gdl_path",
        "instagram": {
            "directory": [""],
            "cookies": [
                "chrome",
                null,
                null,
                null,
                ".instagram.com"
            ],
            "skip": "abort:4",
            "archive": "_downloaded.sqlite3",
            "postprocessors": [
                {
                    "name": "metadata",
                    "event": "post-after",
                    "filename": "_metadata.jsonl",
                    "mode": "jsonl",
                    "open": "a"
                }
            ]
        }
    }
}

And when I download certain instagram profile by gallery-dl https://instagram.com/USER/ -d "C:\mylocaltion\test\", it will download images/videos and _metadata.jsonl to that destination.

However, the _downloaded.sqlite3 will bein CWD instead.

mikf commented 8 months ago

Take look at #642. There are several examples in there.

In your case, you should import gallery_dl.config and gallery_dl.job, set your config options, and run a Job with your input URL.

from gallery_dl import config, job

options = { ... }
url = '...'

config._config.update(options)
dl = job.DownloadJob(url)
dl.run()

but then it either terminate the entire python process

argparse raises SystemExit for --version.

or print everything twice (??) sometimes.

No idea either.

how can I make archive file relative to download destination?

All relative paths are always relative to CWD.

You could change it beforehand, or you might be able to do this by enabling path-metadata and using it in the archive path as a format string: {gdl_path.realdirectory}.... This might result in some weird behavior, though.

fireattack commented 8 months ago

Thanks! That helps tremendously. I ended up just using dynamic

config.set((), 'base-directory', str(user_dir))
config.set(('extractor', 'instagram'), 'archive', str(downloaded))

to set paths for all these companion files.

Also the double-print is related to output.initialize_logging() being called twice when I was using .main(), which now I ensured to only call it once globally.

I'm curious why even if you don't call output.initialize_logging() at all and straightly go dl.run() (as your example), it still generates INFO-level log?

mikf commented 8 months ago

I'm curious why even if you don't call output.initialize_logging() at all and straightly go dl.run() (as your example), it still generates INFO-level log?

It doesn't, at least not for me. I get WARNING and ERROR logging messages, since that seems to be the default level for logging.getLogger() objects, but not INFO or DEBUG. Maybe this is different for your stdlib implementation.

fireattack commented 8 months ago

I'm curious why even if you don't call output.initialize_logging() at all and straightly go dl.run() (as your example), it still generates INFO-level log?

It doesn't, at least not for me. I get WARNING and ERROR logging messages, since that seems to be the default level for logging.getLogger() objects, but not INFO or DEBUG. Maybe this is different for your stdlib implementation.

Maybe I don't use the word right.

I meant that even without initialize_logging(), it still prints the files you downloaded. I would assume if you don't initialize, it won't print anything at all since no logger was set.

mikf commented 8 months ago

Downloaded file output is separate from logging messages. Their paths get directly written to sys,stdout and can be controlled with output.mode.

throwaway242685 commented 7 months ago

hi, how do I download files from oldest to newest?

I'm using this:

https://www.instagram.com/{my_user}/saved/all-posts/

and I need to start downloading from the oldest posts first, how do I do that?

JailSeed commented 7 months ago

Hi! Is it possible to download posts from Pixiv from a specified bookmark page? For example I want to download not all bookmarks but only from page 2. I tried this URL /bookmarks/artworks?p=2 but gallery-dl still downloads all my bookmarks.

Hrxn commented 7 months ago

@mikf Would a formatting option like "{title!t:?/__/R__//}"be legitimate? Would an order of operations like this

  1. trim
  2. replace '__' with ''
  3. append '__' if title

be possible?

mikf commented 7 months ago

@throwaway26425 Not possible, especially not with how IG returns its results. You could theoretically grab all download links (-g) from newest to oldest, reverse their order, and then download those.

@JailSeed Not really supported. You could use --range, but that selects by file count and not post count.

@Hrxn This would work, but it probably crashes when there's no title. !t would need to be applied after ? or at least after some form of check that title is a string.

This could be more reliably done in an f-string:

\fF …{title.strip().replace("__", "") + "__" if title else ""}…
Hrxn commented 7 months ago

@mikf Thanks, that helps. Agree about the f-string part, but I think in this case the site always provides a title, so I don't think there would be anything that contends against continuing to use "{title!t:?/__/R__//}"..

Hrxn commented 7 months ago

@mikf The scenario: Submission on reddit, hosted on redgifs, but it's actually an image (yes, I know.. edge case. But I've seen it at least once)

I believe it should be possible to solve this with a conditional directory setting using what we already got in gallery-dl, but I'm not sure.

Accessing metadata coming from reddit can be done with locals().get('_reddit_), but I'm unsure if we can proceed from there on without breaking..

Example from -K on a reddit link:

is_video
  False

but at the same time

media['oembed']['type']
  video

and

post_hint
  rich:video

which.. totally makes sense..

The easiest way would probably be something like this

                "directory": {
                    "'_reddit_' in locals() and extension in ('mp4', 'webm')" : ["Video"],
                    "'_reddit_' in locals() and extension in ('gif', 'apng')" : ["Gif"],
                    "'_reddit_' in locals() and extension in ('jpg', 'png')"  : ["Picture"],

in using extension from redgif, which already exists! But it does not work for "directory", because it's a metadata entry for "file and filter". Would it be very complicated to make extension also available as a directory metadata value?

mikf commented 7 months ago

Wouldn't a classify post processor work here?

It wouldn't really be complicated to make extension available for directories, but it is kind of wrong given the current "directory" semantics.

Hrxn commented 7 months ago

classify would work, and I'm using it for everything in redgifs except for the image subcategory so that I can differentiate between downloading a single item directly on redgifs vs a submission on reddit hosted on redgifs, and I'm not sure how to achieve that otherwise.

config excerpt (a bit simplified), giving me the output paths I'm using for a while now and would like to keep:

"redgifs":
{
    "image":
    {
        "directory": {
            "'_reddit_' in locals()": ["+Clips"],
            "locals().get('bkey')"  : ["Redgifs", "Clips", "{bkey}"],
            ""                      : ["Redgifs", "Clips", "Unsorted"]
        }
    }
}

I'm using "parent-directory": true and "parent-metadata": "_reddit_" for reddit, obviously, and the result is basically this:

input URL from.. Output Destination
redgifs "base-directory" / Redgifs / bkey \| Unsorted / \<filename with metadata from redgifs only>
reddit "base-directory" / Reddit / Submissions / \<subreddit title> / bkey \| Unsorted / +Clips / [1]

[1] = \<filename with metadata from redgifs and from _reddit_>

This is an example with a direct submission link from reddit, but it works the same with different categories from reddit (with a different "prefix" name instead of Submissions, of course)

It wouldn't really be complicated to make extension available for directories, but it is kind of wrong given the current "directory" semantics.

Ah, okay. I thought this would be just one more metadata field, basically, without breaking anything. Best to forget this approach then, I'll see if I can come up with another one.

mikf commented 7 months ago

Wouldn't it be possible to use reddit>redgifs as category to distinguish Reddit-posted Redgifs links from regular ones and only use the post processor there?

"reddit>redgifs":
{
    "image":
    {
        "directory": ["+Clips"],
        "postprocessors": ["classify"]
    }
},
"redgifs":
{
    "image":
    {
        "directory": {
            "locals().get('bkey')"  : ["Redgifs", "Clips", "{bkey}"],
            ""                      : ["Redgifs", "Clips", "Unsorted"]
        }
    }
}
Hrxn commented 7 months ago

Good idea. Almost forgot that this option exists. To be honest, I've never used this "new" extractor>child-extractor option syntax.

Seems like it should be the right fit for such a task. But does this change anything with regard to how the "archive" option works? Or is this just an additional step, i.e. the options in "reddit>redgifs", for example, get simply added "on the top", and everything else like archive options etc. is kept as is?