mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.74k stars 959 forks source link

Questions, Feedback and Suggestions #11

Closed Hrxn closed 6 years ago

Hrxn commented 7 years ago

A central place for these things might be a good idea.

This thread could serve as a starting point, results will eventually be collected in the project wiki, if appropriate and useful.

Edited 2017-04-15 For conciseness

Edited 2017-05-04 Removed nonsensical checklist thing


mikf commented 7 years ago

The cache list is iterated over sequentially when trying to map an URL to its Extractor class, so items at the beginning of the list are considered first.

Hrxn commented 7 years ago

So I've been playing around with these sites a bit (from the release notes for v0.9.1.)

4plebs
archivedmoe
archiveofsins
desuarchive
fireden
loveisover
nyafuu
rbt
thebarchive

If I get this right, all of these are basically archive or backup sites for the bigger Chan sites (4chan, 8chan? I think mostly 4chan, apparently)

And they all seem to run the same board software. They have each their own extractors (.py), but all that is done there is inherited from a chan.FoolfuukaThreadExtractor base class.

There this is declared: category = "foolfuuka", which then gets overridden by the site-specific extractor file (and this is mostly the only thing specified there, it seems).

So, if I'm not wrong, all sites listed above use this: directory_fmt = ["{category}", "{board[shortname]}", "{thread_num} - {title}"]

I assume something like board[longname] is not supported by all sites, so this other variant gets used instead, right?

Okay, when I now want to group all output from these sites together in one directory, so that my base-directory does not get to swamped with output from sites I probably barely use. What do I have to do? Only prepend the directory output with something like this? ["Chan-Archives", "{category}", "{board[shortname]}", "{thread_num} - {title}"] For this, I think? : "extractor.chan.directory" And this would leave all other default settings intact?

mikf commented 7 years ago

I assume something like board[longname] is not supported by all sites, so this other variant gets used instead, right?

The actual 4chan extractor doesn't provide a long name for its boards which is why the short name is used here. There appear to be long names in FoolFuuka's API responses, though:

Keywords for filenames:
-----------------------
board[name]
  Traditional Games
board[shortname]
  tg

What do I have to do?

You would have to set the directory value for each site individually (extractor.4plebs.directory, extractor.archivedmoe.directory, ...), which is probably not what you want to do. Maybe the -d cmdline option to overwrite the base-directory is useful here.

This is a pretty big limitation imposed by the config system, but there is not too much that can be done here, I think, except changing the way that these extractors do there config-lookup so that you indeed could, for example, specify extractor.chan.directory to set defaults for all of them.

(Also: #18)

Hrxn commented 7 years ago

Yeah, a bit later after sending the comment I realized that this is not how the extractor config could possibly work, alone for the fact that chan.py has two base classes (ChanThreadExtractor and FoolfuukaThreadExtractor) and I could not have set directory outside of them..

You would have to set the directory value for each site individually (extractor.4plebs.directory, extractor.archivedmoe.directory, ...) [...]

No big deal, in my opinion. I wouldn't mind it, just writing the config once to get it right, and it's done.


What I probably had in mind was something discussed earlier in here, e.g. https://github.com/mikf/gallery-dl/issues/11#issuecomment-291249858

[..] If non is found, the config system falls back to the default value.

But yeah, falling back to a default does not work if it's not really the same extractor..

But what about this new commit here? : https://github.com/mikf/gallery-dl/commit/60a888a1e4707de4c3ba4870e05bad792d601543

(Also: #18)

Should we continue over there?

mikf commented 7 years ago

This commit changes the config-lookup for extractors inheriting from FoolfuukaThreadExtractor (all the classes you listed above) and allows you to set default config values in the extractor.foolfuuka.* tree.

Each of these extractors is going to first look at its own config (e.g. extractor.4plebs.*), then at the foolfuuka one (extractor.foolfuuka.*) and will only then fall back to its defaults.

Should we continue over there?

I've only linked this issue because it might be thematically fitting. It's perfectly fine to talk about this here.

Hrxn commented 7 years ago

Okay, so all I have to do now is this? (In "extractor":..)

"foolfuuka":
        {
            "directory": ["Chans-Archives", "{category}", "{board[shortname]}", "{thread_num} - {title}"]
        },

To group the output of all 9 archive site extractors into the same directory, while keeping all other settings ( category being the specific site/extractor) etc., right?

mikf commented 7 years ago

Yes, that does what you want.

You might even remove "{category}", since all these sites are archiving the same source material and one board/thread-num combination would refer to the same thread (and therefore the same images) among all of them ... or at least among the sites that archive the same boards.

Hrxn commented 7 years ago

Might be a good idea, I'll try that. They are probably not exactly in sync when doing their archiving, but that is not really a problem.

By the way, warosu seems to be an archive as well, does board-id/thread-id also match here?

mikf commented 7 years ago

Seems that way. The /g/ threads on warosu even link to their counterparts on archived.moe and rbt.asia, and they all share the same thread-id.

llelf commented 7 years ago

There’s a good deal of keyword∢value info, πŸ‘ on that. It’s a pity it all will vanish, because currently there’s no way to save it somewhere [xattrs!] What do you all think?

Hrxn commented 7 years ago

Question regarding https://github.com/mikf/gallery-dl/commit/f3fbaa5c3eda2bc04ff9ce6c6c5dbb7903253507

[reddit] allow users to override the API User-Agent

For setting extractor.reddit.user-agent Maybe it's just me, but I think the rules they list contradict themselves a bit. Not sure what they really expect. But I think the important part is not to pretend to be a browser.. πŸ˜„ Or lie in any other way.. So what would you suggest? Just follow the given example Example: User-Agent: android:com.example.myredditapp:v1.2.3 (by /u/kemitche) and emulate that a bit. basically?

mikf commented 7 years ago

@llelf you can currently store any metadata in JSON format by passing -j and redirecting the output to a file, but that doesn't work very well and also doesn't download any images while doing that. Using xattrs seems like a good idea so I'll be looking into that.

@Hrxn they state that every user-agent string should look something like <platform>:<app ID>:<version string> (by /u/<reddit username>), which is currently set to Python:gallery-dl:0.8.4 (by /u/mikf1). Take this and replace gallery-dl with the name of your registered application and mikf1 with your own username, but just modifying the given example a bit should work as well. I'm not sure how strict they are about all of this ("NEVER lie about your user-agent" is written in bold ...) but I wanted to avoid a situation were multiple "applications" use the same user-agent as gallery-dl and they block all of them.

ezagarskaya commented 6 years ago

Guys, first thanks for a great project, it helps me a lot!

The question is: How to add a delay between download requests? My speed is too high, I am afraid safebooru will block me soon.

I have used "safebooru": { "wait-min": 6, "wait-max": 10, "timeout": 30, "filename": "{id}.{extension}" }, but nothing changed

Please, help me

mikf commented 6 years ago

There is currently no way to add a delay between downloads or limit download speeds, but I guess I will be looking into that next.

wait-min and wait-max are only available for exhentai and chan.sankakucomplex, because they would either actively block you or respond with "429 Too Many Requests" status codes if you didn't wait between requests to their sites, but so far these have been the only two were this was necessary (I doubt safebooru is going to block you).

In the meantime you could collect a few image URLs from safebooru by using -g and use another program that supports these features (aria2, wget, etc) to download them:

# get the first 500 image URLs and download them at 500kb/s, waiting 5s after each download
$ gallery-dl -g --range 1-500 "http://safebooru.org/..." > url_file
$ wget -i url_file --limit-rate=500k --wait=5

# get the next 500
$ gallery-dl -g --range 501-1000 "http://safebooru.org/..." > url_file
...

(The timeout option only works for the HTTP downloader and has a default value of 30, so settings it there doesn't do much)

Hrxn commented 6 years ago

@mikf Some extractors don't specify directory_fmt in their source (for example gfycat.py)

What to do? Manually setting another value for category? I.e., this one: extractor.gfycat.category? Because that is apparently the default directory that gets always used. Or is it better to use this? extractor.gfycat.directory

Which allows to use some sub-dirs, i.e. ["Gfy", "In", "Here"]?

mikf commented 6 years ago

If you want to change an extractor's target directory, you should set it's directory value (here extractor.gfycat.directory).

(Extractor) classes will use the values specified in their base class if these aren't specified in the class itself, which in this case means that gfycat extractors are using the value set in the Extractor class (see Extractor.directory_fmt). There is nothing special about not specifying a directory_fmt value. All it does is basically saving 1 line of code.

It is also not possible to overwrite an extractor's category. extractor.gfycat.category is not a value that gets recognized.

Hrxn commented 6 years ago

Thanks, got it. Made some new targets for some directory prefs, can confirm, all seems to work fine! πŸ˜„

Hrxn commented 6 years ago

@mikf There's some unusual behaviour, although I don't think it's a real issue, maybe a cosmetic one, and I assume something like this is specific to Windows as well. I hope it's not too much of a nitpick, probably just a question of different ways to implement it in detail..

For each directory option (extractor.*.directory) we can set a list of strings to specify a target directory for the extraction process, where each string in this list results in its own path segment. This happens by using Python format strings, and by virtue of Python's excellent cross-platform support (at least that's what they say, right?), defining a target directory like this: ["Extractor", "Example", "Subdir", "{title}"] Will give us the following result:

But here's the thing: It does not work in the same way for the base directory. Consider this as my value for base-directory in gallery-dl.conf: "D:/Download/Pictures" What happens now, when using the extractor from the example here, the output messages printed to the console window appear like this (again, Windows):

D:/Download/Pictures\Extractor\Example\Subdir\{title}\filename_id_1.ext
D:/Download/Pictures\Extractor\Example\Subdir\{title}\filename_id_2.ext
(and so on)

Alternatively, setting base-directory to this: "D:\Download\Pictures" Results in an error message, improperly escaped sequence etc. pp. This is maybe not really a surprise, considering that \ is usually a standard escape character. Understandably, setting base-directory to this: "D:\\Download\\Pictures" seems to work then, giving output messages like this:

D:\Download\Pictures\Extractor\Example\Subdir\{title}\filename_id_1.ext
D:\Download\Pictures\Extractor\Example\Subdir\{title}\filename_id_2.ext
(and so on)

Okay, so it appears that, and please correct me if my conclusion is wrong, the base-directory property does not utilize the same Python format string as the directory options. Is there any specific reason for that? I'm not sure, but I just assumed that all parts rely on the same format string, which then gets joined together to the final output format string, and that is the end result we see.

I did a quick code search, I think this is the relevant result:

https://github.com/mikf/gallery-dl/blob/a1980b16f31a9a8952adb64f1cd37bcfabc3072c/gallery_dl/util.py#L356-L359

Or maybe these functions? https://github.com/mikf/gallery-dl/blob/a1980b16f31a9a8952adb64f1cd37bcfabc3072c/gallery_dl/util.py#L410

https://github.com/mikf/gallery-dl/blob/a1980b16f31a9a8952adb64f1cd37bcfabc3072c/gallery_dl/util.py#L379

mikf commented 6 years ago

The value for base-directory is supposed to be just a static string that gets put in front of all paths generated during runtime. Its environment variables get expanded, but it doesn't go through any string formatting and its path separators (/, \) are left alone by os.path.join.

The full path gets build by something like os.path.join(base_directory, format(segment1), format(segment2), ..., format(filename)) which concatenates all parts using either / or \ depending on your OS, but anything inside these parts stays the way it is. So if you put any forward slashes into your base directory, they will still be there afterwards.

You can actually use a list of strings as directory segments for base-directory, which will be joined with the "correct" slashes, but thanks to how os.path.join works, you would still have to manually put a slash after the drive letter: ["D:\\", "Download", "Pictures"]. So that doesn't really help ...

As for a reason why it it works the way it does: In the earlier versions of this project I wanted a way to direct all downloads to a common base-path which is how this option came to be; and it has stayed like this ever since. There is a static part + a dynamic part + a filename, which seems reasonable to me.

To solve your "slash" problem: I guess I could just replace all forward- with backward-slashes on Windows which should result in a consistent use of \ as path separator. (edit: https://github.com/mikf/gallery-dl/commit/d241a0fb6022535efe0401ab3bcc1960e082239e)

Hrxn commented 6 years ago

Interesting to know, thank you for the explanation.

In summary, we could say the true cause of this "issue" is the Python interpreter and its implementation itself, right? Depending on the OS, of course, but apparently the functionality of os.path etc. just takes any basic string and doesn't bother further. I assume that Python (on Windows) itself then uses some standard Windows API function for the output directory, and the Windows API doesn't care either about proper path separators, if I recall that correctly. In the end, I guess we can only speculate whether this is all a design decision or simply a small lapse. But okay, I digress..

Thank you anyway for addressing this very specific nuisance.. πŸ˜„

But with the latest commit, what is the one true way to write my gallery-dl.conf? Or does it really matter, because the path separators now always get replaced, either way?

mikf commented 6 years ago

we could say the true cause of this "issue" is the Python interpreter and its implementation itself, right?

Well, not really. The functionality is well documented, so I could have somehow worked around this, but I didn't realize that forward slashes in Windows could be an issue ... doesn't help that I'm not using Windows myself.

But with the latest commit, what is the one true way to write my gallery-dl.conf?

As you said yourself, it doesn't really matter. Both work (/ or \\), so just use what looks best.

Hrxn commented 6 years ago

@mikf Happy New Year! 🍾 πŸŽ† πŸŽ‡

If I may inquire, are you currently planning on adding support for some new sites? Or already something in the pipeline? Other plans in that regard?

Because I'd like to make a suggestion, basically, and maybe get some other opinions and feedback in here πŸ˜„

mikf commented 6 years ago

Happy New Year to you, too.

There are no plans on adding support for new sites from my side, but I have been thinking about adding a few features - an equivalent of YoutubeDL's --download-archive and (maybe) a way of executing external processes after each image download (post-processing, writing metadata, etc.) - as well as finally adding some necessities like GitHub issue templates and a contributing guide.

If you have an idea or suggestion about improvements, (new) features, site support, etc., just open a new issue and let me know.

Hrxn commented 6 years ago

I presume that something like --download-archive would be a useful feature, agreed. Good idea, actually.

Not sure if templates for GitHub are really that necessary, considering the rather low amount of opened issues. If the tracker gets flooded with new issues, this would be a different story. But if you think that the repository would feel like something's missing, for lack of a better description right now, don't mind my comment on this πŸ˜‰

I will definitely open a new issue for a new site, but I wanted to gather some feedback first, and since this thread is already in existence [1], I thought it would be a good idea to simply ask first. Dunno, I would really like to see some other users chiming in here, but so far there aren't that many, unfortunately.

Okay, everyone reading this, please let me know: What do you think of adding support for ArtStation, for example?

[1] Although I admit, I am not too happy about it. Because, technically, this is not a real issue, rather a "meta-issue", and this rubs my OCD in the wrong way, because it goes a bit against the principles of consistency and purity, and is kind of a conceptual issue in itself πŸ˜„ But I don't know what would work better instead right now. I think something like a #gallery-dl channel on IRC would be nice to have, and I would totally come and hang out there, but off-site solutions are usually less than ideal solutions.

Maybe this Projects feature on GitHub would be a good alternative? This one here: https://github.com/mikf/gallery-dl/projects Maybe some kind of Note can be opened, as a quick stop for any kind of discussion or something, not sure.

Hrxn commented 6 years ago

Anyone? Please?

Bfgeshka commented 6 years ago

Functions covered in GH projects and issue tracker are virtually the same. Only important factor here is personal preference of main maintainer. I think that common tracker is much more straightforward.

mikf commented 6 years ago

The projects page doesn't seem particularly suited to fulfill a similar role as this meta-issue here does. Having an issue for general discussion is a lot more accessible/visible then a "meta-project" on the projects page and, as Bfgeshka said, much more straightforward for the average user.

But you are right, there should probably be another way and place for general questions and discussion. An IRC channel (on freenode?) would nice and all but it would most likely require some sort of logging bot to be useful. Another alternative might be Gitter, which is used by quite a few other GitHub projects. I've played around with it a bit and registered a "community" and room there: https://gitter.im/gallery-dl/main . Maybe that is something to use.

Hrxn commented 6 years ago

This Gitter thing is pretty nice.. especially the integration with GitHub, definite advantage over a normal vanilla IRC channel.

As I understand it, the projects feature offers better visualization and organization of all related matters, in the form of boards, kanban style. I personally like these, but it might take some time to get used to it for any novices, and at the current state of the project in general, primarily activity, it might be a bit overkill right now. And you are right, accessibility and visibility should be the main concern here. I mean, any board/notes whatever in Project can be mentioned (and linked) in README.md, thus appearing directly on the "front" page, but on the other hand, the majority of users on GitHub is already familiar with the Issues tab, and that is therefore the place where they go/search first, I assume.

In the meantime, the meta-issue is definitely fine with me, no complaints here. Although on my end, not sure if you are affected as well, I can notice a small delay when opening this issue, it's not slow or anything, but noticeable, in my opinion. And as #11 here continues to grow, I guess at some point we'd have to close it and open a new one πŸ˜„

But okay, I think we're already in bike-shedding territory here. So, what do you think of ArtStation: πŸ‘ or πŸ‘Ž

rachmadaniHaryono commented 6 years ago

@mikf can you recommend a way to cache the result of the extractor?

  1. can you explain the message type on https://github.com/mikf/gallery-dl/blob/master/gallery_dl/extractor/message.py? how the keyword should be? how does gallery-dl handle each type of message?

  2. i try the gist you write

j = job.UrlJob("http://example.org/image")
j.run()  # prints "http://example.org/img.jpg"
print(j.extractor)

this will take a long time as example link of reddit thread, where it will find another links and extract it directly. so i'm trying a custom `UrlJob', which handle message with type Message.queue as Message.Url.

class CustomUrlJob(job.UrlJob):

    def run(self):
        try:
            log = self.extractor.log
            for msg in self.extractor:
                if msg[0] == Message.Queue:
                    _, url, kwds = msg
                    self.update_kwdict(kwds)
                    self.handle_url(url, kwds)
                else:
                    self.dispatch(msg)
                ...

is there better way to do it?

mikf commented 6 years ago

Caching extractor results (and a bit more) is what the DataJob class does, but you can have this a lot easier than that. Extractor results are just tuples where the first element is one of these message-type identifiers from message.py which determines the type and meaning of the other elements.

To just copy all of these tuples for later use, try this: https://gist.github.com/mikf/052916c25a9bda7d6876a355cacbe88f

And the UrlJob thing is a bit of a mistake on my part and will be fixed in one of the next commits. For the time being, set UrlJob.maxdepth to 1 and it should pass Queue messages to its handle_url() method.

edit: updated the gist code to use extend() instead of append()

mikf commented 6 years ago

@Hrxn: before I forget, I'm also noticing a considerable delay when opening this issue, so closing this and creating a new one might be in order. ArtStation gets a πŸ‘ from me, but I would like to have this a separate issue with example URLs and all that.

Hrxn commented 6 years ago

Roger that, closing this and opening issue for new site soon. πŸ‘