mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
10.73k stars 883 forks source link

Questions, Feedback and Suggestions #3 #146

Closed mikf closed 4 months ago

mikf commented 5 years ago

Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue. There is also https://gitter.im/gallery-dl/main if that seems more appropriate.

Links to older issues: #11, #74

mikf commented 3 years ago

@kattjevfel https://github.com/mikf/gallery-dl/commit/e300da14245376d32a45a92628b82d2563834e8b

Ghost-Terms commented 3 years ago

@mikf For kemono.party, I wanted to use the edited field of the metadata, but it doesn't seem like %Y %m %d etc. works on it. I think it's because it's formatted differently in the keywords:

date
  2020-04-06 03:04:29
edited
  Fri, 18 Dec 2020 23:39:27 GMT
extension
  png
filename
  287cbb42-026d-42dc-a045-fc945c91e8fa
id
  35689730
num
  1
published
  Mon, 06 Apr 2020 03:04:29 GMT

Is there a way to make this work, and possibly even fall back to date if for some reason the edited keyword is empty/null?

mikf commented 3 years ago

Is there a way to make this work

Not with the current string formatting options. There is a way to parse and format timestamps, but not textual date/time info.

fall back to date if for some reason the edited keyword is empty/null?

Theoretically with | ({edited|date}), but both fields must support any eventual format specifiers or is throws an exception. {edited|date:%Y%m%d} for example wouldn't work since the string value in edited can't be formatted with %Y%m%d.

wankio commented 2 years ago

what happen with --abort-on-skip ? i can't use it anymore, ty

mikf commented 2 years ago

@wankio it got replaced with --abort (or -A) back in 2019 (6393b47db2b8daacd34c837fa31a4641fc908272, ed6592ea1aaa8661554cd3eab6a454fcd5083544). It now requires an argument that specifies how many files to skip before stopping. --abort-on-skip was the same as -A 0.

Ghost-Terms commented 2 years ago

Is it possible to modify postprocessors options via command line using -o? I'm guessing I might have to rely on having an alternate config file instead, but I just want to make sure.

mikf commented 2 years ago

@ImportTaste not possible at the moment. Try the things mentioned in https://github.com/mikf/gallery-dl/discussions/1933.

Ghost-Terms commented 2 years ago

I have a pretty annoying issue. For some reason, one of the patreon posts on kemonoparty has an invalid id and I'm getting this error with a 123456789 > id filter: ValueError: invalid literal for int() with base 10: 'PbFkhZdV'

I haven't been able to figure out a good workaround. I'm sure it is a patreon post and not one of the other services on the site, I triple-checked. I posted the URL here: https://snippet.host/adgg

Fukitsu commented 2 years ago

Is there a way to set different options depending on the OS? For example set the path of the logfile to ~/gallery-dl/ if it's Linux or to D:\Images\gallery-dl\ if it's Windows

mikf commented 2 years ago

@ImportTaste you could use isdecimal() to check if an id is a number and convertible to int:

id.isdecimal() and 123456789 > int(id) to exclude or not id.isdecimal() or 123456789 > int(id) to include posts with non-numeric IDs

@Fukitsu only by using separate config files.

You could also use 3: one with Windows specific paths, one with Linux specific paths, and a third with general settings for both Windows and Linux. Settings from multiple config files will just get merged together.

Ghost-Terms commented 2 years ago

Is there any way to have a format string that surrounds a metadata value with curly brackets in the filename? I tried using escape characters as well as escaping the escape characters (once for json and again for python), and I even tried doing \u007B and \u007D (the unicode escape sequences for them), but no luck.

The main reason I wanted to use them was for titles, I never really see artists put curly brackets in the titles for their stuff, but they use parentheses and square brackets all the time, so I want to surround the titles with curly brackets.

EDIT: If this isn't currently possible, maybe the format string interpreter could be changed so that unicode escape sequences are only interpreted after the replacement fields are?

mikf commented 2 years ago

To use curly brackets as regular characters in format strings, you have to double them.

https://docs.python.org/3/library/string.html#format-string-syntax

If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.

Ghost-Terms commented 2 years ago

To use curly brackets as regular characters in format strings, you have to double them.

  • {key} -> <value of 'key'>
  • {{key}} -> {key}

https://docs.python.org/3/library/string.html#format-string-syntax

If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.

But does this work within replacement fields too? i.e. {title:? {{/}}/}

mikf commented 2 years ago

But does this work within replacement fields too? i.e. {title:? {{/}}/}

It seems like double brackets aren't even needed after the :. Just {title:? {/}/} works for me for some reason.

Alternatively you could also use conditional filenames. One format string with a valid title and one without.

Ghost-Terms commented 2 years ago

But does this work within replacement fields too? i.e. {title:? {{/}}/}

It seems like double brackets aren't even needed after the :. Just {title:? {/}/} works for me for some reason.

Alternatively you could also use conditional filenames. One format string with a valid title and one without.

Conditional filenames? You have my attention, that sounds incredibly useful.

EDIT: Found the details: https://github.com/mikf/gallery-dl/issues/1394 Includes details on conditional directories too.

mikf commented 2 years ago

They are a thing since 1.18.0/.1: 4cf40434, 84d2e640, fd00d471

You can select a different format string depending on the result of a Python expression (same as --filter). This functionality is explained in extractor.*.filename and extractor.*.directory and also showcased in gallery-dl-example.conf.

Ghost-Terms commented 2 years ago

They are a thing since 1.18.0/.1: 4cf4043, 84d2e64, fd00d47

You can select a different format string depending on the result of a Python expression (same as --filter). This functionality is explained in extractor.*.filename and extractor.*.directory and also showcased in gallery-dl-example.conf.

Huh. That's really interesting. I have a question about filters actually. It seems like the way it currently works is that it extracts all the posts and then compares the filter against it, which is the main reason I don't use an id comparison filter for Twitter since it would have to comb 3000+ tweets for each user.

But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False? That would also make it so that all the metadata.event post json files wouldn't be redownloaded each time.

I figured I'd just ask about it here instead of submitting a feature request since I've had an unfortunate habit of submitting feature requests that were already possible in ways I didn't expect.

mikf commented 2 years ago

But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False?

You can call abort() or terminate() in all gallery-dl Python expressions. (including filename conditions, even if that doesn't make much sense)

For example --filter "date >= datetime(2021) or terminate()"

The potential problem here that this exits as soon as it finds a file with a date before 2021, even though there could be files/posts after that that are from 2021. As long as you can guarantee all files have the correct and expected order, that works fine, but this is Twitter we are talking about.

Also why not use the regular --abort or --terminate?

That would also make it so that all the metadata.event post json files wouldn't be redownloaded each time.

There's a filter for post processors that makes it only run if the expression is True (3cbbefd4)

Ghost-Terms commented 2 years ago

But what if the filter could instead be evaluated against each post as it comes up and then allows for an abort/terminate as soon as it encounters a post that evaluates False?

You can call abort() or terminate() in all gallery-dl Python expressions. (including filename conditions, even if that doesn't make much sense)

For example --filter "date >= datetime(2021) or terminate()"

The potential problem here that this exits as soon as it finds a file with a date before 2021, even though there could be files/posts after that that are from 2021. As long as you can guarantee all files have the correct and expected order, that works fine, but this is Twitter we are talking about.

Also why not use the regular --abort or --terminate?

I do for most services, but for other services it hasn't been so straightforward.

Right now I want to revise how I handle Twitter. Currently, to keep the amount of duplicates down due to retweets, I use {tweet_id} with extractor.twitter.retweets=original and put them into {author[id]} directories, then an exec script hardlinks them into {author[name]} directories. Unfortunately, this has has the side effect of making retweets trigger against skip=abort and filters with great frequency, so I use abort:100. That was fine at first, but it makes Twitter take the longest to scrape by far.

Because retweets=original retrieves the metadata for the original tweet, the retweet_id will match the tweet_id, my current plan is to use this new information to have a filter check if the retweet_id is 0. If it isn't, then it'll do its comparison check against the tweet_id, effectively giving retweets a free pass. And, as soon as it hits the last recorded tweet_id from the user, it'll abort.

There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top, it'll make it so it's the first retrieved tweet, and the extractor will abort immediately. Knowing that you can call abort() and terminate() in filters is helpful, but I would need the functionality of abort:2 to get past what's pinned.

Does twitter expose whether a tweet is a pinned tweet in their API? If so, then that can be remedied by having that added to the metadata and checking against that, too. I could also have my script retrieve the id of the first tweet from a user and change the filter to have that id be given a free pass too, which would be made a little easier with that new environment variable functionality you added.

mikf commented 2 years ago

There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top.

There's now a 'pinned' option to disable pinned Tweets: 9156e90f

Does twitter expose whether a tweet is a pinned tweet in their API?

It does, but actually setting a flag only for the pinned version of that Tweet and not accidentally also for the real, regular Tweet entry might be more trouble than it's worth.

Ghost-Terms commented 2 years ago

There's one problem with this idea though - pinned tweets. Any time a Twitter user has one of the old posts pinned to the top.

There's now a 'pinned' option to disable pinned Tweets: 9156e90

Does twitter expose whether a tweet is a pinned tweet in their API?

It does, but actually setting a flag only for the pinned version of that Tweet and not accidentally also for the real, regular Tweet entry might be more trouble than it's worth.

That works, thank you.

Ghost-Terms commented 2 years ago

The configuration documentation says that downloader.progress outputs to stderr, but it actually seems to be outputting to stdout. Is the documentation out of date, or is this a bug?

mikf commented 2 years ago

The configuration documentation says that downloader.progress outputs to stderr

I don't think it does, but you are right in that the downloader progress should go to stderr and not stdout as it does right now.

Ghost-Terms commented 2 years ago

The configuration documentation says that downloader.progress outputs to stderr

I don't think it does, but you are right in that the downloader progress should go to stderr and not stdout as it does right now.

It's on line 2714: https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#L2714 (This doesn't jump to the line for some reason)

Here, the edit interface jumps to the line properly: https://github.com/mikf/gallery-dl/edit/master/docs/configuration.rst#L2714

mikf commented 2 years ago

But that only describes where logging messages go. It has nothing to do with any downloader or its progress bar.

Ghost-Terms commented 2 years ago

But that only describes where logging messages go. It has nothing to do with any downloader or its progress bar.

I see, my mistake then.

Fukitsu commented 2 years ago

For exhentai (or other sites) is it possible to make it so that once the image limit is reached or the download stopped due to some error log/print to the screen all the remaining URLs along with the current one that was downloading?

Ghost-Terms commented 2 years ago

Just curious, but there's currently no way to have gallery-dl download in ascending order, is there? It'd obviously have to gather every post first and then reverse the order, but there have been instances where I felt it more appropriate to start ascending from the oldest rather than descending from the newest.

ghost commented 2 years ago

Why does --ugoira-conv use VP8 by default instead of VP9?

Hrxn commented 2 years ago

@mikf Congrats on making it into the 5K stargazers club on GitHub! 🥂

Hrxn commented 1 year ago

@mikf I just noticed something from the configuration.rst docs..

The extractor.*.cookies option supports three types, gallery-dl's Path, as well as the built-ins object and list.

As I understand it, these types are used to distinguish the functionality, or rather the "source" of the cookies used.

But gallery-dl's Path can also be used in two different ways, because it also supports two different types: string and list (i.e. list of string values)

Here's the thing, what happens when extractor.*.cookies gets set to a list? I assume this always enables the automatic cookies from browser functionality, then? In other words, this is occluding the option to use a list as a Path here, correct?

I mean, this is not really a problem, but maybe it should be mentioned.. 😅

mikf commented 1 year ago

You are correct, that's something I missed when writing the code and/or docs. I didn't account for a cookies.txt path possibly being a list.

I wouldn't change anything about it though, except maybe update the type for cookies.txt paths.

wankio commented 1 year ago

Is there any way to create txt/html file for each 4chan/thebarchive/etc.. thread? and download thumbnail instead if source file not found ?

mikf commented 1 year ago

No to both of your questions. You can try https://github.com/bibanon/BASC-Archiver or similar to archive everything, including HTML.

clocklikewoz commented 1 year ago

I am using gallery-dl to backup some booru sites, but shoving millions of files into one directory quickly turns it into a "no humans zone" that will hang any file manager for several minutes. Does anyone have advice for how to manage this situation? I was thinking of splitting them into folders with a few thousand files each based on post IDs (say 1-999, 1000-1999, etc), but I don't know how I would do that. Or maybe something like the first three characters of the MD5, which would be easier to implement (could probably use something like {md5:[0:2]} ) but that makes finding specific post numbers harder.

Twi-Hard commented 1 year ago

This is how I split every 5000 image IDs into their own folder:

"directory":
[
    "\fE str(id // 5000 * 5000 + 5000)"
],

You can replace 5000 with some other amount if you want.

upintheairsheep commented 1 year ago

Add developer documentation on how to make an extractor, and add metadata to it.

Fukitsu commented 1 year ago

Is there a way to download an Instagram profile using the user ID??

mikf commented 1 year ago

@Fukitsu As with Twitter, you can use id:<id> as username in input URLs.

https://www.instagram.com/instagram/
# is the same as
https://www.instagram.com/id:25025320/
Infinitay commented 1 year ago

@Fukitsu As with Twitter, you can use id:<id> as username in input URLs.

https://www.instagram.com/instagram/
# is the same as
https://www.instagram.com/id:25025320/

Will the folders be saved with the id as the directory name or the username?

Hrxn commented 1 year ago

This depends on your "directory" setting..

mikf commented 1 year ago

The available metadata fields will be the same for both username and user ID (except tagged_username and tagged_full_name for the Tagged extractor)

clocklikewoz commented 1 year ago

I set up a postprocessor in my config to download metadata for Discord posts on kemono.party, but it is also activating the postprocessor I have set up for all other kemono.party downloads. They differ in filename, the generic one is not useful for Discord posts and vice versa. I saw in the documentation that this is intentional, but I can't figure out how to disable the generic kemono.party postprocessor only when downloading Discord. Can this be done or will I have to copy my default kemono.party postprocessor for every other subcategory and remove the sitewide one?

mikf commented 1 year ago

Add

    "filter": "subcategory != 'discord'"

to the sitewide one and it should not trigger for Discord posts.

Fukitsu commented 1 year ago

Is there a way to use range or chapter-range when specifying input files? Neither gallery-dl -i file.txt --range x-y nor gallery-dl -i file.txt --chapter-range x-y work

mikf commented 1 year ago

You mean selecting URLs from an --input-file with a --range-like option? Not possible with gallery-dl at the moment. --range and --chapter-range in this case only affect the results of the input file URLs, but not the URLs themselves.

Hrxn commented 1 year ago

FYI

https://blog.gitter.im/2023/01/16/gitter-is-going-fully-native-matrix-in-feb-2023/

[..] replace the old Gitter app with a Gitter-customised version of Element during the week of Feb. 6th 2023.

They're planning to keep all content of existing Gitter rooms, but we'll see..

Element, this means that one should still be able to sign-in there with an existing GitHub account. Although jumping through some authorization hoops again, probably.

@mikf I also noticed there's no link to gallery-dl's Gitter at the moment? There's still a link in README.rst, but it seems it is not referenced (anymore)?

Not sure, Gitter seems pretty dead to me, anyway. Not just gallery-dl, even other, much bigger, projects don't have any activity in their Gitter rooms.

Dunno, maybe time to use Discord? What do you guys think?

Fukitsu commented 1 year ago

How do I make gallery-dl retry 404s? I've put retry-codes": [400, 404, 408] under the extractor and "http": { "retry-codes": [400, 404, 408] } under downloader but still doesn't seem to work

mikf commented 1 year ago

@Hrxn Regarding Gitter, I haven't actually visited the page for roughly a year, so I've removed its link in the README page for the time being.

Discord is something I was always against, but I guess that's the platform that's "in" at the moment so here is an invite link to some sort of gallery-dl Discord server: https://discord.gg/ZytBUdYq7n. It will eventually find its way into the README, but there is still a lot to do.

I'm also currently looking into registering #gallery-dl on libera.chat.

@Fukitsu If you are using v1.24.4, retry-codes only works for file downloads. v1.24.5 will add general support for all HTTP requests once it is out.

Hrxn commented 1 year ago

@mikf Just wondered, if you set both "path-restrict" and "path-replace" to some "conflicting value" at the same time, e.g.

does "path-restrict" always have precedence in such a scenario, even if "path-replace" is set below/later? As I assume, if you set such a character->replacement character association in the object for "path-restrict", does the "path-replace" option even matter?

Unlike, maybe, if it's done like this?

What happens if you set these options at the base level, and then use "path-replace" again at any "deeper"/more specific category level? Does it overwrite the replacement char then? Or if you use "path-restrict" again, can you update/overwrite specific replacement association options this way? Sorry for asking such whacky questions.. 😄

PS: Maybe pinning the latest "Question, Feedback and Suggestions" thread at the top of the issues would be beneficial? What do you thing? Or maybe too much of a distraction?