mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
10.76k stars 885 forks source link

Questions, Feedback and Suggestions #3 #146

Closed mikf closed 4 months ago

mikf commented 5 years ago

Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue. There is also https://gitter.im/gallery-dl/main if that seems more appropriate.

Links to older issues: #11, #74

Jaid commented 1 year ago
  1. Are exit status codes documented somewhere? Got 20 once (seems to be for missing auth configuration) and a lot of 4 (seems to be for successful runs where some of the images are no longer available), but I can only guess.

Edit: Found the answer in the code. ✓ https://github.com/mikf/gallery-dl/blob/bd5d08abbcd5729c93a85d2189bef1561959b3b4/gallery_dl/exception.py#L51-L72

  1. Can I load cookies from environment variables that I reference in my config.json? Tried it, and either I did something wrong or this way is not supported.
  1. How do I use --option to inject more complex structures into the loaded configuration, like this whole block?
{
  "extractor": {
    "twitter": {
      "directory": {
        "'reply_to' in locals()": [
          "{category}",
          "{user[name]!l}",
          "replies",
          "{reply_to|'-'!l}",
          "{tweet_id}"
        ],
       "(some dynamic query)": [
          "{category}",
          "{user[name]!l}",
          "(some dynamic value)",
          "{tweet_id}"
        ],
        "": [
          "{category}",
          "{user[name]!l}",
          "tweets",
          "{tweet_id}"
        ]
      }
    }
  }
}

Assuming the values are too dynamic to use gallery-dl’s expression language. Also assuming I can’t store a temporary, dynamically generated config.json.

mikf commented 1 year ago

@Hrxn path-replace gets only applied when path-restrict is just a simple string and not an object.

path-replace was implemented after all the other path-* options in response to #755, so it might feel a bit off.

What happens if you set these options at the base level, and then use "path-replace" again at any "deeper"/more specific category level? Does it overwrite the replacement char then? Or if you use "path-restrict" again, can you update/overwrite specific replacement association options this way?

The same as with all other options, i.e. the general setting gets completely overwritten by the more specific one. It is not possible to update some replacements, you'll have to copy everything you want to keep.

PS: Maybe pinning the latest "Question, Feedback and Suggestions" thread at the top of the issues would be beneficial? What do you thing? Or maybe too much of a distraction?

That's a really good idea.

I wonder why this hadn't been done to the other threads before. Maybe pinning issues was not implemented back then, or maybe it just didn't occur to me for some reason.

mikf commented 1 year ago

@Jaid

Can I load cookies from environment variables that I reference in my config.json? Tried it, and either I did something wrong or this way is not supported.

That's not supported via config, but you could use -o "cookies.auth_token=${galleryDlTwitterAuthToken}" on command-line.

How do I use --option to inject more complex structures into the loaded configuration, like this whole block?

The VALUE part for --option NAME=VALUE can be any complex data structure as long as it's JSON parsable. Try something like --option 'extractor.twitter=<all twitter options as JSON>'

brsk93 commented 1 year ago

How do I make gallery-dl only output warnings and errors? I tried this from the example config

"output": {
        "log": {
            "level": "warning"
        }
    }

But it doesn't work, It still prints the name of every file it downloads and fills my terminal with too much info. I run a script that scrapes a lot of accounts and I have to scroll tens of thousands of lines. I just want it to print something in my terminal if something goes wrong...

mikf commented 1 year ago

@brsk93 Set output.mode to "null".

    "output": {
        "mode": "null",
    "log": {
        "level": "warning"
    }
    }
dajotim937 commented 1 year ago

Is it possible (I mean from a technical point of view) for you to modify gallery-dl to trigger parent postprocessor if parent extractor invokes sub-extractors? If so, should I create new issue for this feature request?

Also, there is no way to filter fields that dump into json except to manually write json-like format into metadata.content-format or remove all unnecessary fields, is it?

mikf commented 1 year ago

@dajotim937

It would theoretically be possible to trigger a parent post processor when spawning a child; the problem is that the necessary data structures aren't necessarily initialized at that point in time (or ever). You can create an issue if you want and I might be ably to hack something together, but this is something more for v2.0.

To filter metadata fields, use a metadata post processor in delete mode before running another, normal one that writes data:

"postprocessor": [
    {
        "name": "metadata",
        "mode": "delete",
        "fields": ["user", "create_date", "foobar"]
    },
    {
        "name": "metadata",
        "mode": "jsonl",
        "filename": "-"
    }
]
dajotim937 commented 1 year ago

To filter metadata fields, use a metadata post processor in delete mode before running another, normal one that writes data:

Well, yeah I remember that. It just.. Reddit has so many useless(in my case) json fields, so it would be easier just write "I need these 10 fields, other 30 you can delete" instead to manually write each of 30 field to remove.

You can create an issue if you want and I might be ably to hack something together, but this is something more for v2.0.

Okay. In my workflow I managed to trigger one postprocessor only if reddit metadata exists, but I will create an issue just in case you decide to implement it.

AKL55 commented 1 year ago

Is there a way to blacklist galleries that have more than 100 pages on exhentai etc?

mikf commented 1 year ago

@AKL55 --filter 'int(filecount) <= 100 or abort()' works on exh specifically, but it still has to fetch the first image. You can also use &advsearch=1&f_spt=100 when searching to have exh itself filter out large galleries.

AKL55 commented 1 year ago

Also i tried --filter "lang in ('eng')","int(filecount) <= 100 or abort()" but only the page range works

mikf commented 1 year ago

Conditional statements can be combined with and and or: --filter "lang == 'en' and int(filecount) <= 100 or abort()"

But you should really use the site-internal search. It's much more efficient. https://exhentai.org/?f_search=language%3Aenglish%24&advsearch=1&f_spt=100

omfgntsa commented 1 year ago

Hi, is it possible to set different download rates for different sites? If so can someone show an example config file, thank you.

dajotim937 commented 1 year ago

@omfgntsa downloader.*.rate Instead * put any extractor what you need.

Example of config file. It has only special setting for downloader.ytdl.module but you can figure out how to set rates.

omfgntsa commented 1 year ago

Hi I'm sorry I cannot find this line "downloader.ytdl.module" in the example config file you linked. I tried adding downloader..rate and replaced with the site name of the site I want to limit but I can't get it to work.

account00001 commented 1 year ago

Are there any plans to provide a build for Intel/AMD - 64 Bit?

mikf commented 1 year ago

@omfgntsa I already answered your question in #3865, but no, you cannot set different download rates for different sites at the moment.


@account00001 "Nightly" builds provide executables for "64 Bit" (x86_64). The Windows .exe files on the releases page will most likely always be 32 Bit (x86) only.

github-userx commented 1 year ago

I now hav been wondering for a long time already:

How can we specify to use a certain extractor/plugin? I am asking for this case:

a Site uses a Chan Board (like 4chan, 8ch, etc) but gallery-dl doesn’t support the domain/url natively. if I remember correctly gallery-dl does support these kind of image boards?

dajotim937 commented 1 year ago

@github-userx https://github.com/mikf/gallery-dl#examples

github-userx commented 1 year ago

Thanks! I just noticed it as well after posting my (dumb) question :D

I tried 4chan and 8ch extractor for an Image Board but it did not work / wasn’t compatible it seems. The image board looks like the usual Chan boards.

mikf commented 1 year ago

@github-userx The only two image boards that have a generic implementation and support a <boardname>: prefix are lynxchan and vichan.

github-userx commented 1 year ago

Message ID: @.***>You’re the best, Mike! I tried it and it turned out to be a lynxchan board! Vichan it wasn’t and didn’t work.

This is great! I’ve been impressed by gallery-dl and all your efforts for many years! I always recommend gallery-dl to everyone and always mention your name to people when I talk about OpenSource projects and super responsive & supportive developers,   Stay  awesome! ;) 

github-userx commented 1 year ago

Message ID: @.***>I was wondering, can we combine the Remote source ("r:https://domain.com/ https://domain.com/……") feature with the -g parameter ? And how would we handle it if we have to specify the extractor:

gallery-dl -g "lynxchan:https://imageboard-example.com/board/catalog.html

In this one we can’t use the r: option to extract URLs, correct?

taskhawk commented 1 year ago

Having this in my config for Reddit:

"filename": "{filename}.{extension}",
"videos": "ytdl"

Why am I getting the filename _DASH720.mp4 instead of iiric8uvdtx91.mp4 for this video:

https://www.reddit.com/r/selfie/comments/ylh6bf/

But the correct one, p6gu760aftx91.mp4, for this other one?

https://www.reddit.com/r/croptopgirls/comments/ylhetd/

mikf commented 1 year ago

@github-userx

You can combine -g and r: by doing gallery-dl -g r:<URL>. You don't need to specify an extractor in this case, since all r: does is load the page from the specified URL and return everything that looks like a link, i.e. it starts with https:// or http://.

A better way to get all download URLs from an image board catalog would be using -G or -gg to go one level deeper and resolve all returned threads.

This is great! I’ve been impressed by gallery-dl and all your efforts for many years! I always recommend gallery-dl to everyone and always mention your name to people when I talk about OpenSource projects and super responsive & supportive developers, Stay awesome! ;) 

Thanks!

@taskhawk

I'd guess this is because the format selected by ytdl is different for these two posts, resulting in a different download URL and {filename}.

taskhawk commented 1 year ago

Thank you, that clued me in and started reading more.

I installed yt-dlp as a PIP module and checked the URLs with yt-dlp --dump-json to see what keys it offered and added this to my config:

"downloader": {
    "ytdl": {
      "outtmpl": "%(id)s.%(ext)s"
    }
  }

Now it works as I wanted it to.

taskhawk commented 1 year ago

Seems a few backticks got messed up in the description of extractor.redgifs.format in the config doc page.

9696neko commented 1 year ago

So I misused --write-info-json instead of --write-metadata and I want to go back and correct this by only getting metadata. I am not sure what combination of -skip=, my archive file (I was thinking of editing it via SQL), and command I should run to minimize my footprint (Pixiv FYI). I also have a copy of the logs/console output to, worse case, put together a list of links. >_<

taskhawk commented 1 year ago

@9696neko, check issue #220.

mikf commented 1 year ago

@9696neko in particular, you should use --no-download and disable skip and archive with --no-skip --download-archive "". You can also add some delay between API requests with --sleep-request 1-2.

9696neko commented 1 year ago

@taskhawk, @mikf: Thanks very much. Sorry, I forgot that issues search exists ^_^

github-userx commented 1 year ago

Does gallery-dl have an output Filename template that can include metadata like upload_date and username? Similar to yt-dlp:

-o '%(uploader)s_%(upload_date)s_%(title)s_%(resolution)s_[ID=%(id)s].%(ext)s'

dajotim937 commented 1 year ago

@github-userx Use -K key to check which metadata you can use for filename. gallery-dl.exe -K *link*

Documentation: https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractorfilename

github-userx commented 1 year ago

Message ID: @.***>thanks @dajotim937 !

taskhawk commented 1 year ago

Is it possible to add support for tilde expansion in exec.command when using a list?

Currently it has to be like this:

"command": ["/home/user/script.sh", "{_filename}"]

But I wish this could work:

"command": ["~/script.sh", "{_filename}"]
mikf commented 1 year ago

@taskhawk https://github.com/mikf/gallery-dl/commit/790dd365e15a21fc5caeabec9b9259167f4f6f12

taskhawk commented 1 year ago

Thanks!

orangpelupa commented 1 year ago

Anyone know the commands to use Gallery-dl to download private albums that you have the guest links?

taskhawk commented 1 year ago

Which site? If having the link is the only requisite to access the content maybe you don't need anything else? Are you getting some error?

orangpelupa commented 1 year ago

Yeah link not supported. You can grab the Flickr guest pass links on this Mangaka 5 dollars Patreon https://www.patreon.com/motokamurakami

taskhawk commented 1 year ago

Don't know the structure of guest pass links but Flickr is a supported site so try to force the extractor and see if it works:

If a site's address is nonstandard for its extractor, you can prefix the URL with the extractor's name to force the use of a specific extractor:

gallery-dl "tumblr:https://sometumblrblog.example"

orangpelupa commented 1 year ago

unfortunately, it says the same thing, unsupported url.

lx30011 commented 1 year ago

Would it be possible for the flickr extractor to fetch aperture, shutter speed, focal length, iso, camera model, lens, perhaps the remaining exif data to save in metadata 2023-06-01_10-04-10

simeonrobinson commented 1 year ago

Is it possible to scrape the text from a Patreon post and write it to a file?

taskhawk commented 1 year ago

Yes, check #4107.

9696neko commented 1 year ago

Would it be possible to have a summary at the end of a gallery-dl run? I.e. list the totals of what types of files were downloaded. This would help determine whether the last run was correct. I currently use a combination of steps and it is quite tedious removing duplicate counts.

Hrxn commented 1 year ago

Sounds good.. Although it's maybe not immediately obvious what constitutes a good summary, or what would be an appropriate definition of "correct" here.

You should definitely make use of the log file, though..

9696neko commented 1 year ago

Sounds good.. Although it's maybe not immediately obvious what constitutes a good summary, or what would be an appropriate definition of "correct" here.

You should definitely make use of the log file, though..

The log file is not very informative either at even debug level. I also cannot confirm renamed files. Maybe it's just me. I would like to see something like:

Total downloaded: 100
Total failed/missing: 2
PNG's: 22
WEBM's: 4
JPG's: 50
GIF's: 24
Infinitay commented 1 year ago

Has anyone recently had issues with Instagram cookies expiring more frequently?

github-userx commented 1 year ago

I don’t have first hand experience with it myself but I’ve heard it’s gotten even more strict and difficult scraping Instagram.

Couple years ago we could scrape dozens of feeds/accounts without issue..good old times!