mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
12k stars 978 forks source link

Questions, Feedback and Suggestions #11

Closed Hrxn closed 6 years ago

Hrxn commented 7 years ago

A central place for these things might be a good idea.

This thread could serve as a starting point, results will eventually be collected in the project wiki, if appropriate and useful.

Edited 2017-04-15 For conciseness

Edited 2017-05-04 Removed nonsensical checklist thing


mikf commented 7 years ago

This is actually a really good idea, especially since I'm very hesitant/lazy about documenting things or writing text in general.

edit: The more I think about it, the less satisfied I am with the previous explanation, so here is version 2.

Keywords for filenames: age_limit: all-age artist-id: 11 artist-name: pixiv事務局 artist-nick: pixiv book_style: right_to_left ... category: pixiv content_type: None created_time: 2017-03-31 13:50:53 extension: jpg favorite_id: 0 height: 865 id: 62178245 ...

- These key-value pairs are used to generate directory- and filenames by plugging them into [format strings](https://docs.python.org/3/library/string.html#formatstrings). For directories this is a list of format strings to work around the different path segment separators in Windows and UNIX systems (backslash `\` or slash `/`).

- Each extractor has a default format string for directory- and filenames. For `pixiv` this is
directory_fmt = ["{category}", "{artist-id}-{artist-nick}"]
filename_fmt = "{category}_{artist-id}_{id}{num}.{extension}"
- The default values can be overwritten in your configuration file by setting the appropriate `directory` and `filename` values.

{ "extractor": { "pixiv": { "directory": ["my pixiv images", "{artist-id}"], "filename": "{id}.{extension}" } } }



- The `category` of each extractor is a keyword supplied in every key-value pair collection. It can therefore be used in every format string and has been chosen to be the first segment of every default format string for directory names, but that can, as explained above, be changed.

(edit end)

If something still doesn't make sense, just tell me and I will try to explain this a bit better.
Hrxn commented 7 years ago

Very good to know, thank you.

Checked some profiles with --list-keywords, very useful, and returns exactly what expected. Everything according to plan, at least on the extraction side :)

I realized what caused the slight confusion (for me): The default format string set by the extractor gets overwritten by the output format defined in gallery-dl.conf, got that, all working as expected so far.

What put me a bit off was this: https://github.com/mikf/gallery-dl/blob/master/gallery-dl.conf#L17-L30

Because pixiv seems to be a bit of a special case here. Defining two different formats for directory, because pixiv makes use of two different "sub-extractors" ( for lack of a better word): "user": {..}, and "bookmark": {..}

I think these are called objects in JSON parlance..

Now, if I want to use my own directory and filename values in gallery-dl.conf, along the lines of your given example:

{
    "extractor":
    {
        "pixiv":
        {
            "directory": ["my pixiv images", "{artist-id}"],
            "filename": "{id}.{extension}"
        }
    }
}

I put these two definitions into the "pixiv" object, that is, one level above the "user" and "bookmark' objects, right? This way, both definitions from each object get overwritten with the customized output format. A bit non-obvious, but this might just be me. And as long as it's working, nothing to complain here ;-)

mikf commented 7 years ago

Because pixiv seems to be a bit of a special case here.

What you have discovered here is true for all extractors, especially those with more than one extractor per module, and not just pixiv. In general the configuration value located the "deepest" inside the dictionary- or object-tree is used. If non is found, the config system falls back to the default value.

An example:

{
    "extractor":
    {
        "pixiv":
        {
            "user": { "filename": "A" },
            "filename": "B"
        },
        "deviantart":
        {
            "image": { "filename": "C"}
        },
        "filename": "D"
    }
}

With a configuration file like the one above, the following is going to happen:

I put these two definitions into the "pixiv" object, that is, one level above the "user" and "bookmark' objects, right? This way, both definitions from each object get overwritten with the customized output format.

Yes, if you have those two definitions at this place, then all pixiv extractors (there are 4 in total) will use these instead of their default format strings.

If you want to dig even deeper, take a look at the inner loop of the config.interpolate function. For example for the pixiv.user extractor this function gets called like so:

    directory = config.interpolate(["extractor", "pixiv", "user", "directory"], default)

This function first searches the top-most level for a value with key "directory" and stores this value if it finds it. It then descends into the "extractor" object and, again, searches this level for a value with key "directory". The same goes on with "pixiv" and "user" until it finally reaches the end. If at any point something goes wrong and an exception gets thrown, which happens if for example the "pixiv" object doesn't exist, then the value stored up to this point gets returned.

Hrxn commented 7 years ago

Okay, got it. Also, found all 4 pixiv extractors ;-)

Very nice, and very flexible. Ultimately, every possible variant can be customized. Excellent.

Just threw some pixiv URLs at the program, can confirm everything works indeed as described! (Including these multiple images per entry/"work", I did a manual recount ;-)

On to the next one..

Okay, this probably is a newbie question, but it looks like exhentai isn't a real site? There is e-hentai, seems like they are related (sister sites?). And you apparently need an e-hentai account first (and some dark magic, probably) before you can use exhentai. I will read a bit into this first.

Pretty sure that is the first time I've ever encountered something like this.

But this theory has a little flaw: If these two sites are indeed related, I'd assume that they don't differ much on the technical side, if at all. But trying some gallery links from e-hentai got me this:

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047429/525823ef87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047429/525823ef87/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047407/f00ba6d6cf/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047407/f00ba6d6cf/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047272/a003dfb22b/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047272/a003dfb22b/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047010/d8b62a3c87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047010/d8b62a3c87/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047424/0218b04f9c/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047424/0218b04f9c/'

C:\Users\Hrxn>gallery-dl --version
0.8.1-dev

C:\Users\Hrxn>

Or is there another specific reason for this?

mikf commented 7 years ago

exhentai is basically the "dark" version e-hentai with all the non-advertiser-friendly stuff enabled.

You should be able to access this site by doing this:

In the past the domain of the regular site was g.e-hentai.org and I haven't updated the extractor to also accept e-hentai.org. You can just change the URLs a bit and replace the - with an x or put a g. in front. It all falls back to the same code that relies on having access to exhentai.

Hrxn commented 7 years ago

Okay, made an account, will frequent the site a bit and see how it works out then..

Can't test it before, because https://github.com/mikf/gallery-dl/commit/b603b592cfc17c036f2e7fbbee8f7c7ed4be98ec changed the expression pattern, and that part works, but it's still the exhentai extractor and therefore requires credentials for authentication. Which is not really an issue, don't get me wrong.

I will test some other sites in the meantime, and will update my initial post accordingly.

mikf commented 7 years ago

I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.

Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. Take a look at this.

Hrxn commented 7 years ago

I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.

I'm not sure, but other random sources on the Internet indicate that this is actually the case.

Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. [...]

Yes, obviously. That is nice, but it won't be necessary, I've already made an account and started using it a bit. Besides, creating and using different accounts for different sites and services doesn't really bother me at all. If there is some longer gap between my responses, it's only because I'm busy with something else ;-) I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)


Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion:

There is another feature on DeviantArt I wasn't aware of before: The Journal.

I noticed it while using gallery-dl with this profile: http://inkveil-matter.deviantart.com/

The site states: 190 deviations. gallery-dl download: 155 files.

Luckily, there is a statistics page which explains this: http://inkveil-matter.deviantart.com/stats/gallery/

InkVeil-Matter has 93,840 pageviews in total; their 35 journals and the 155 deviations in their gallery were viewed 733,738 times.

35 Journal entries, so 190 in total.

Shamelessly copied from the DeviantArt Wikipedia page:

Journals are like personal blogs for the member pages, and the choice of topic is up to each member; some use it to talk about their personal or art-related lives, others use it to spread awareness or marshal support for a cause.

Not sure if that is useful at all. I clicked around a bit, and saw nothing I would consider as missing. Embeds from their own gallery, or from any other, and some links to some drawing feature of DeviantArt I also didn't know of before: Muro, can be seen when visiting sta.sh for example, which also belongs to them as it seems.

I don't know, not sure If I even really understand this feature yet.

Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)

mikf commented 7 years ago

I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)

Thank you for the suggestion but I'm going to stay with my trusty GPG-encrypted plain text file :)

Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion

Even if this platform here is called an issue tracker, feel free (and even encouraged) to create new "issues" if you want to suggest or request a feature or support for a new site.

There is another feature on DeviantArt I wasn't aware of before: The Journal.

This seems to be just a collection of blog posts, which might contain references to other deviantart- or sta.sh images. There shouldn't be any images missing: 190 deviations consisting of 155 real deviations and 35 journal entries seems about right to me. I could add an extractor to fetch those references and download all the images of a journal entry if you want me to.

Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)

No worries, I don't mind walls of text and actually wasn't aware of the journal or muro, so thanks for telling me.

ghost commented 7 years ago

Sorry asking about it, this tool have a feature to able remembering what image is already downloaded, without checking local directory? iirc on package is already have sqlite dll, right?

Thank you.

mikf commented 7 years ago

No, I am sorry, but such a feature does currently not exist. gallery-dl only skips downloads if a file with the same name already exists, but there is at this time no other way of "remembering" if an image has been downloaded before. SQLite, as you have noted, is already being used, but that is only to cache login sessions and the like across separate gallery-dl invocations.

Feel free to open a separate issue If you want a feature like this being implemented, but please explain in greater detail what you actually want to do and/or need this feature for.

Hrxn commented 7 years ago

Just saw the new commit adding options for skipping files.

A change from https://github.com/mikf/gallery-dl/commit/fc9223c072ae7bf6d3809704710c7dd8f6a9984b#diff-283aceda91c5f7f10981253611f9f950

    def _exists_abort(self):
        if self.has_extension and os.path.exists(self.realpath):
            raise exception.StopExtraction()
        return False

Current extractor run, in this context, means just the 'active' URL, right? Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this: gallery-dl --input-file FILE

Maybe a case for an additional option. Or rather not, I'm still not sure about it, need to make up my mind first probably.

mikf commented 7 years ago

Current extractor run, in this context, means just the 'active' URL, right?

Yes. Each URL gets its own extractor, so the --abort-on-skip option works for each URL independently. Aborting the run of one URL has no effect an any other URLs.

Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this

The -i/--input-file FILE option just appends the URLs inside of FILE to the end of the list of all URLs. gallery-dl -i FILE URL1 is equivalent to gallery-dl URL1 URL2 URL3 if FILE contains URL2 and URL3. Even if, for example, the download for URL1 gets canceled, URL2 and URL3 will still be processed normally.

Maybe a case for an additional option

An --exit-on-skip option that just exits the program on any download-skip would certainly be possible.

Hrxn commented 7 years ago

An --exit-on-skip option that just exits the program on any download-skip would certainly be possible.

Yes, for example. I think the current behaviour is just right as the default, we'll see when someone asks for other variants.

HASTJI commented 7 years ago

Do you plan to add a graphical interface for the program? At least the input fields and the pause / continuation buttons. Also interesting in the possibility of multi-threading and the possibility to plan the uploads one by one via GUI. Yes, I know that it can be done through the console, but still ...

Hrxn commented 7 years ago

Well, I don't know, but if I may, let me add just this: I wish people would realize how much programming work implementing a GUI actually is. And the thing is, that means actual code, lots of lines of code, only for the GUI, and this gets never used outside of the GUI again. So this is just additional work on the top, without any benefit for the actual underlying code.

mikf commented 7 years ago

No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed. A lot of the features you mentioned can already be done via a (reasonable) terminal and shell plus the (GNU) coreutils that usually come with them and I don't really want to re-implement this. I do realize that the CLI "experience" on Windows is terrible, so maybe, if there is a big enough demand, I might add some sort of GUI in the future, but that will always be low priority.

HASTJI commented 7 years ago

No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed.

Hmmm...I can try to build a graphical shell on C# for windows version, that will intercept commands from and into gallery-dl, but I'm not sure that it will take a little time.But I will try my best.

Hrxn commented 7 years ago

Yes, I think that's a good idea. Also, in my opinion, using CLI on Windows isn't too bad. For many uses cases, standard batch scripts (*.bat/*.cmd) should be enough, for example starting gallery-dl with multiple/dozens/hundreds of URLs, and if you need plenty of scripting capabilities, you can use gallery-dl within PowerShell.

I wouldn't even know what to use a GUI program for, to be honest. If the program is running, there isn't much to see, because what actually takes the most time is just transferring data across the net, aka downloading. You could add some fancy progress bars, but this doesn't really change anything, in my opinion. Besides, progress bar support can also be done in CLI, via simple text output written to the terminal, like wget and curl for example.

The only thing I can really think of right now is managing your personal usage history of the program, so to speak. That means having all in one central place, a queue for all URLs that are yet to be processed, and an archive of all URLs that already have been done. This would be more of a meta-program, if you think about it, because all this can be done completely independent of gallery-dl. You could also use this program to write the script files for the CLI then 😄

As a starting point, writing processed URLs to archive file(s) would be a good idea, I think. Something along the lines of the --download-archive option of youtube-dl, for example.

Hrxn commented 7 years ago

Interesting, although these sites seem so similar (and the Gelbooru site even states "concept by Danbooru"), they are yet so different in terms of implementation and functionality.

I just checked again, Gelbooru support for pools may be pretty much irrelevant, at least for now. Because unlike on Danbooru, where pools are used quite extensively, Gelbooru only seems to have 25 pools in total right now, and there is not really much activity. At least that is what I see here, even with an account on Gelbooru. Although an account enables to create own pools (public and private), allowing to collect different posts there which could then be downloaded. So this might be relevant to potential gallery-dl users, maybe..

mikf commented 7 years ago

Gelbooru only seems to have 25 pools in total right now

There seem to be up to 44500 pools If you take a look at the id parameter of one of the pool URLs, but the pagination controls for gelbooru's pool- and tag pages seem to be missing. You can get to the next pool page by setting the pid parameter in the URL (the same mechanism is used on their posts-page): page2 page3 and so on.

Hrxn commented 7 years ago

What made you check that? 25 inconceivably low for a big site? 😉 But you're right, of course. Incidentally, I found the pagination on Gelbooru! It was blocked by uBlock Origin, which I use on Chrome. Well, not just on Chrome, I use it wherever I can, actually. That means that some entry in one of the filter lists breaks the site... Edit: Not sure what tag page exactly, but apart from pools pagination seems to work for me.

mikf commented 7 years ago

This one: https://gelbooru.com/index.php?page=tags&s=list

AdblockPlus + filter list seems to be causing the same issue.

Hrxn commented 7 years ago

Ah, okay. Yes, pagination also broken for me on that listing.

Hrxn commented 7 years ago

Small suggestion:

Add a column to Supported Sites to indicate status of user authentication. Not sure, just a simple "Supported" Yes or No. Or "Required", Yes or No. Or "Required", "Optional"...

By the way, does this case even exist currently? We have extractors that require authentication: Pixiv, Exhentai, and some others And other extractors that don't authenticate at all, right?

mikf commented 7 years ago

Thanks for the suggestion -> done fb1904dd59a95fe7728158666648de5d1dafc52d

And yes, there are two modules with optional authentication: bato.to and exhentai. bato.to only offers a very limited selection of manga chapters if you are not logged in, but it is still usable. Exhentai tries to fall back to the e-hentai version of the site, which only works for some galleries, and original image downloads aren't available as well. (I added the fallback mechanism for exhentai only after our discussion, btw. af56887a47c44a6042b0787ba6ced9341ab169a5)

jtara1 commented 7 years ago

Any chance we can get a manga downloader that groups by volume (and chapter)?

What's the best site for downloading manga? MangaFox and MangaReader leave watermarks. MangaHere leaves an ugly water on each chapter (MangaScreener). MangaPanda leaves MangaReader watermark.

mikf commented 7 years ago

I've been thinking about implementing a way to filter by metadata (something like youtube-dl's --match-filter option), so maybe, at some point in the future, there might be something like that. Right now there is only the --chapters option, which lets you select chapters by index, which is not necessarily the chapter number.

I don't really know which manga site is actually good, but I'd probably suggest kissmanga. Pick your poison, I guess.

Hrxn commented 7 years ago

Small suggestion:

Clarify usage of extractors, sub-extractors and their options in the Readme or the configuration documentation.

I think the general usage of gallery-dl is pretty straightforward, but a novice user might think that it's not immediately obvious.

Here's what is happening, to my understanding: The user submits one (or more) URL(s) to gallery-dl .

  1. gallery-dl matches each URL argument against a regular expression to determine the correct extractor. (Or returns an error for an URL without a matching extractor)
  2. gallery-dl matches the URL path (etc.) against another regular expression to determine the correct sub-extractor variant for each extractor of each URL.

Examples can be seen by running gallery-dl --list-extractors, but this should maybe mentioned more explicitly. Maybe by including it into the information message printed when running gallery-dl without any arguments at all.

Furthermore, all extractors and variants should be properly documented somewhere, I think by either having a complete list or by explaining the proper syntax of the extractor options. Because right now there are only the 2-3 examples mentioned in the gallery-dl.conf for demonstration. Everything else can be figured out from there on, but again, this is maybe not really obvious (enough).

The underlying principle is already in configuration.rst, so far, so good, but

extractor.*.filename and extractor.*.directory etc.

is only the basic part, and it doesn't reflect that you can do more with gallery-dl, by setting the configuration like this:

extractor.*.<sub-extractor>.filename and extractor.*.<sub-extractor>.directory etc.

Again, this is probably not immediately obvious, so the usage of the two variables (extractor, sub-extractor) should be more clear.

So far this can only be figured out by looking at the code, at least that is what I did so far. All the extractors are listed here: https://github.com/mikf/gallery-dl/tree/master/gallery_dl/extractor

The actual name of each extractor can be inferred from the filenames, but if I'm not mistaken, this is more of a coincidence, and the actual name of each extractor is defined by the variable category inside the Python source files.

Along those lines, for every extractor that has sub-extractor variants, these names are defined by the variable subcategory inside the source.

Thankfully, we're not stuck here, but can use the search on Github: Searching for category Searching for subcategory

mikf commented 7 years ago

I've finally gotten around to actually writing some documentation that should make the use of categories and subcategories more obvious and hopefully addresses some of the issues you mentioned. Now to answer some of your points:

Here's what is happening, to my understanding

  1. ...
  2. ...

It's actually simpler than that: Each extractor class has a list of regular expressions (usually containing only 1) which match the whole URL. A match is found by going through all regexps from all extractor classes and applying the regular expressions until one matches the URL. The upper level function for that is only 5 lines of actual code. There is no difference in how the choice between two entirely different extractors or between two related (sub-)extractors is made.

Furthermore, all extractors and variants should be properly documented somewhere ... So far this can only be figured out by looking at the code

There is the list of supported sites and the capabilities listed therein or the output of gallery-dl --list-extractors, which provides a list of all extractor names and as a results all category-subcategory pairs. I would think that this is pretty much enough. Where do you see a problem with that?

Hrxn commented 7 years ago

Thanks for clearing that up and extending the documentation, really appreciated! 👍

You're right, gallery-dl --list-keywords URL provides the correct names of category and subcategory for that URL/Extractor.

And you added this to the documentation, so no problem here:

Each extractor name is structured as CategorySubcategoryExtractor. An extractor called PixivUserExtractor has therefore the category pixiv and the subcategory user.

What I initially had in mind was expanding the output of --list-extractors a bit.

Right now, snipped:

PinterestPinitExtractor
Extractor for images from a pin.it URL
Example: https://pin.it/Hvt8hgT

PixhostImageExtractor
Extractor for single images from pixhost.org

PixivBookmarkExtractor
Extractor for all favorites/bookmarks of your own account

PixivFavoriteExtractor
Extractor for all favorites/bookmarks of a pixiv-user
Example: http://www.pixiv.net/bookmark.php?id=173530

PixivMeExtractor
Extractor for pixiv.me URLs
Example: https://pixiv.me/del_shannon

PixivUserExtractor
Extractor for works of a pixiv-user
Example: http://www.pixiv.net/member_illust.php?id=173530

PixivWorkExtractor
Extractor for a single pixiv work/illustration
Example: http://www.pixiv.net/member_illust.php?mode=medium&illust_id=966412

Which could then not only list an Example URL, but also something along the lines of

Example: ...
Category: ...
Subcategory: ..

Which ends up as the default directory structure, hence it maybe would make sense to mention this more explicitly. But you're right, this is also somewhat redundant now, with the good documentation for the options, and I feel we're already in bike-shedding territory now 😄

It's just that if you keep directory and filename settings intact, you can easily do incremental updates with gallery-dl, which is nice, obviously. I just wanted to make sure that other users also realize this and maybe make use of it 😉

But there's also something which I stumbled upon which lead me to all this:

  1. Excerpt from --list-extractors
    
    ThreedeebooruPoolExtractor
    Extractor for image-pools from behoimi.org
    Example: http://behoimi.org/pool/show/27

ThreedeebooruPostExtractor Extractor for single images from behoimi.org Example: http://behoimi.org/post/show/140852

ThreedeebooruTagExtractor Extractor for images from behoimi.org based on search-tags Example: http://behoimi.org/post?tags=himekawa_azuru dress

2. Excerpt from `--list-keywords http://behoimi.org/pool/show/27`

Keywords for directory names:

category 3dbooru pool 27 subcategory pool

Keywords for filenames:

author darkgray category 3dbooru change 597709 created_at[json_class] Time created_at[n] 183101000 [...]

mikf commented 7 years ago

Yeah, there are a few exceptions to that extractor-name rule for 3dbooru, 4chan and 8chan, as you can't use a digit as the first character of an identifier/class name. I've updated the output of the --list-extractors option to include (sub)categories, like you suggested, to have a consistent way of getting these values (06c4cae05b60a24b2f3f89c1aeda41a24c4be9ff). Class names and --list-keywords do not work all the time, so I hope this is a better solution, even if that made some parts of the documentation useless or redundant.

jtara1 commented 7 years ago

Are there any guidelines for contributing if I wanted to add support for another site? Should I just follow similarly to what's already done?

mikf commented 7 years ago

There are currently no actual contributing guidelines, but following what is already there seems fine.You might as well just copy an existing module and modify its code. Ask if you have any questions about how things work or should work.

Some general rules would be:

Hrxn commented 7 years ago

Small suggestion:

Scrolled a bit through the Python documentation, and it seems that pip always explicitly distinguishes between 'installing' and 'upgrading'. The example here mentions modules specifically, but it's the same mechanism, so I don't think this is any different.

Normally, if a suitable module is already installed, attempting to install it again will have no effect. Upgrading existing modules must be requested explicitly: [..]

The pip-specific documentation doesn't mention it explicitly, but doesn't say otherwise either.

Could be OS dependent, I tested this on Windows and you need to specify the upgrading option here. I probably stumbled not over this before because I always used the --upgrade flag intuitively.

So, in conclusion, it would probably make sense to mention this in README.rst. By either extending point 1.1 with pip install --upgrade gallery-dl and pip install --upgrade https://github.com/mikf/gallery-dl/archive/master.zip or by using these examples in a new inserted 1.2 "Updating via pip" or something..

Now don't ask me about manual installation via python setup.py install etc. If I remember correctly, I've done this in the past (with another package) and updating should work this way, but only if you keep using this method. Because I wouldn't be surprised if this doesn't work any longer as soon as someone used pip install .., because manual installation probably doesn't overwrite then..

rachmadaniHaryono commented 7 years ago

hi, i just made a server feature here https://github.com/rachmadaniHaryono/gallery-dl/tree/feature/server, because i want to use gallery-dl with hydrus together.

i'm not quite sure if this feature is within gallery-dl scope, so i make little change as possible to gallery-dl package.

i'm not quite understand what DataJobj data should contain. if there is any example for it it would help me.

mikf commented 7 years ago

I would like for gallery-dl to just stay as a command-line program (like youtube-dl), but having a separate `gallery-dl-server' project/package and adding some features that are needed by that to gallery-dl itself would be fine.

The data member of the DataJob class just holds a list of all "Messages" that an extractor emits. These usually cause the handle_url(), etc. methods to be called, but in this case they just get stored in the data list and later written to a file or stdout. You should probably just create your own Job subclass and overwrite the handle_... methods. In your case it is probably enough to focus on the "Url" messages. ("Directory" is there to create the target directory for the following images; "Queue" is supposed to offload its URL to another extractor; in the past there were also "Headers" and "Cookies" messages, but these have become obsolete)


class CustomJob(Job):
    def hande_url(self, url, metadata):
        # url is the download-URL of the image as a string
        # metadata is a dictionary
        print(url)

job = CustomJob(input_url)
job.run()  # prints all image URLs
rachmadaniHaryono commented 7 years ago

I would like for gallery-dl to just stay as a command-line program (like youtube-dl), but having a separate `gallery-dl-server' project/package and adding some features that are needed by that to gallery-dl itself would be fine.

i think i will keep it on the fork for now, because no one demand it for now. having to create another repo ask for fragmentation problem.

is getting url only faster than getting its metadata?

is there structure of how this list made?

from what i know 1st element is (1,1), 2nd, is (2, gallery_data_dict), 3rd and so is (index, url, url_data_dict). if there is any error only (error_name, error_str).

You should probably just create your own Job subclass and overwrite the handle_... methods. In your case it is probably enough to focus on the "Url" messages. ("Directory" is there to create the target directory for the following images; "Queue" is supposed to offload its URL to another extractor;

i am still not understand this paragraph.

right now i have no idea how to present the gallery-list and gallery itself on html. you said that focus on url but with metadata it is more helpful.

the current server still present the data on url, because the models still not good enough (see below)

/gallery?data=%5B3%2C%20"https%3A//cdnio.luscious.net/AwronZizao/289557/tumblr_n8yyhlc7jf1tfyunpo1_1280_01BPMTYQ38Q78CHCVTC1YTE3EH.jpg"%2C%20%7B"artist"%3A%20null%2C%20"count"%3A%20"78"%2C%20"extension"%3A%20"jpg"%2C%20"image-id"%3A%20null%2C%20"lang"%3A%20null%2C%20"language"%3A%20null%2C%20"name"%3A%20"N8Yyhlc7Jf1Tfyunpo1%201280"%2C%20"num"%3A%201%2C%20"section"%3A%20"Hentai"%2C%20"tags"%3A%20null%2C%20"title"%3A%20"Socks/Stockings"%7D%5D#

gallery-dl - chromium_035

E:also not quite sure if subclassing bare Job class. Right now I am only using JobData because it is the only function that give only Metadata and pic url.

It would be very helpful if the extracting data part and output part on DataJob class' run method are separated.

mikf commented 7 years ago

i think i will keep it on the fork for now, because no one demand it for now. having to create another repo ask for fragmentation problem.

Can't you just depend on the gallery-dl package and have the GUI or server as its own separate package like, for example, youtube-dl-gui does with youtube-dl?

is getting url only faster than getting its metadata?

No, it isn't. The extractors always provide both, but getting the metadata usually takes no extra time for them (i.e. no extra HTTP requests).

is there structure of how this list made?

The list just contains the message-tuples that the extractor emits. The first item in each tuple is always the message-identifier (one of these constants here: message.py and the additional items are the arguments for this particular message type. The (1,1) originated from a yield Message.Version, 1 here, the (2, gallery_data_dict) came from a yield Message.Directory, gallery_metadata and so on. You shouldn't rely on these tuples being in any particular order as this can vary per extractor-class.

you said that focus on url but with metadata it is more helpful.

I said to focus on URL-messages, which contain the actual URL and its metadata (See the code-example above)


edit: i've written some simple example code that might be helpful: https://gist.github.com/mikf/0e591c7ef290097f29adb662ae730424 Bear in mind that calling job.run() might take a really long time.

rachmadaniHaryono commented 7 years ago

thanks for the snippet. that look easier than using the current subclass of DataJob

Can't you just depend on the gallery-dl package and have the GUI or server as its own separate package like, for example, youtube-dl-gui does with youtube-dl?

if there is any interest on this i will make a repo out of it.

about gallery_data_dict and url_data_dict, i suppose it depend on each parser and there is no definite key and value type?

I said to focus on URL-messages, which contain the actual URL and its metadata (See the code-example above)

you are right. i think i misunderstand the url and url-message

mikf commented 7 years ago

about gallery_data_dict and url_data_dict, i suppose it depend on each parser and there is no definite key and value type?

Yes, as there doesn't seem to be a good way of fitting all the different metadata variants of each parser/extractor into a single schema. Similar extractors usually have the same metadata-keys, but you can always check these with gallery-dl --list-kewords <URL> or you just print the key-value-pairs of the metadata dict directly.

Hrxn commented 7 years ago

What about my last suggestion? Not a good idea? To all you Python experts here this is maybe painfully obvious, I know, but what about other (hypothetical) users? I just want to avoid any possible pitfalls, not sure, maybe issue reports made by someone with outdated versions, unwittingly..

rachmadaniHaryono commented 7 years ago

@mikf then how does one define the key? what if someone made tag and other made tags?

also is the dict always one dimensional? i mean on luscious tag key's value it could be made into list instead of csv string.

@Hrxn like this?

e: ignore it. you are right, it is already on readme and i skip reading the section

$ wget https://github.com/mikf/gallery-dl/archive/master.zip
$ unzip master.zip
# or
$ git clone https://github.com/mikf/gallery-dl.git

$ cd gallery-dl
$ python setup.py install
# or
$ pip install .
# or if you want to upgrade it
$ pip install --upgrade .

also pip can also install directly from github

$ pip install git+git://github.com/mikf/gallery-dl.git

reference https://stackoverflow.com/questions/8247605/configuring-so-that-pip-install-can-work-from-github

Hrxn commented 7 years ago

Yea, not only directly from GitHub, pip can also install from any source archive, i.e. pip install https://github.com/mikf/gallery-dl/archive/master.zip

That is already mentioned in README.rst, but updating doesn't work this way, at least on Windows. To be fair, it's mentioned in the info message, but it is easy to overlook, in my opinion, especially for users not used to this stuff..

mikf commented 7 years ago

What about my last suggestion?

Oh. I apologize. I was a bit busy last week and then I kind of forgot about this. After some tests it seems that pip install --upgrade uses a reasonable default behavior for installing and upgrading, so I've changed README.rst to that (0b576cc131f48a86062bb6349054344fd611d23d). It may also be necessary to mention the use of sudo, the admin-console in Windows, the --user flag or even how to install pip itself for older Python versions, but I think this may be a bit too much and the pip documentation, which is linked in the README, mentions basically everything a user needs to know.

@rachmadaniHaryono I'm not entirely sure what you mean by "define the key", but if this is about choosing an appropriate name, then just use something reasonable that describes what the value is about and try to be consistent with other similar extractors, so don't use tag if tags is used everywhere else.

The dictionary can also be multidimensional (for example pixiv, deviantart, flickr) and the tags value could as well be a list, but there hasn't been a need for that specific thing up till now.

rachmadaniHaryono commented 7 years ago

@mikf is it possible to add extractor from outside of the program?

Hrxn commented 7 years ago

^ Could you please elaborate a bit on what you mean by "outside of the program"? Some kind of extractor from another Python project? Some other 3rd-party program?

rachmadaniHaryono commented 7 years ago

Some kind of extractor from another Python project? Some other 3rd-party program?

yes. afaik to add a new extractor, someone have to create PR to the program.

is it possible to programmatically add extractor and use gallery-dl downloader method?

mikf commented 7 years ago

It is possible to add "outside"-extractors to gallery-dl, although this method is not particularly clean. Take a look at this: https://gist.github.com/mikf/94199249d1eb0b9d82726f178661d831

rachmadaniHaryono commented 7 years ago

Looking at code, does the priority on the index depend on 'cache' 's index?