Closed Hrxn closed 6 years ago
This is actually a really good idea, especially since I'm very hesitant/lazy about documenting things or writing text in general.
edit: The more I think about it, the less satisfied I am with the previous explanation, so here is version 2.
--list-keywords
option:
$ gallery-dl --list-keywords http://www.pixiv.net/member_illust.php?id=11
Keywords for directory names:
artist-id: 11
artist-name: pixiv事務局
artist-nick: pixiv
category: pixiv
subcategory: user
Keywords for filenames: age_limit: all-age artist-id: 11 artist-name: pixiv事務局 artist-nick: pixiv book_style: right_to_left ... category: pixiv content_type: None created_time: 2017-03-31 13:50:53 extension: jpg favorite_id: 0 height: 865 id: 62178245 ...
- These key-value pairs are used to generate directory- and filenames by plugging them into [format strings](https://docs.python.org/3/library/string.html#formatstrings). For directories this is a list of format strings to work around the different path segment separators in Windows and UNIX systems (backslash `\` or slash `/`).
- Each extractor has a default format string for directory- and filenames. For `pixiv` this is
directory_fmt = ["{category}", "{artist-id}-{artist-nick}"]
filename_fmt = "{category}_{artist-id}_{id}{num}.{extension}"
- The default values can be overwritten in your configuration file by setting the appropriate `directory` and `filename` values.
{ "extractor": { "pixiv": { "directory": ["my pixiv images", "{artist-id}"], "filename": "{id}.{extension}" } } }
- The `category` of each extractor is a keyword supplied in every key-value pair collection. It can therefore be used in every format string and has been chosen to be the first segment of every default format string for directory names, but that can, as explained above, be changed.
(edit end)
If something still doesn't make sense, just tell me and I will try to explain this a bit better.
Very good to know, thank you.
Checked some profiles with --list-keywords
, very useful, and returns exactly what expected. Everything according to plan, at least on the extraction side :)
I realized what caused the slight confusion (for me): The default format string set by the extractor gets overwritten by the output format defined in gallery-dl.conf
, got that, all working as expected so far.
What put me a bit off was this: https://github.com/mikf/gallery-dl/blob/master/gallery-dl.conf#L17-L30
Because pixiv seems to be a bit of a special case here.
Defining two different formats for directory
, because pixiv makes use of two different "sub-extractors" ( for lack of a better word): "user": {..},
and "bookmark": {..}
I think these are called objects in JSON parlance..
Now, if I want to use my own directory
and filename
values in gallery-dl.conf
,
along the lines of your given example:
{
"extractor":
{
"pixiv":
{
"directory": ["my pixiv images", "{artist-id}"],
"filename": "{id}.{extension}"
}
}
}
I put these two definitions into the "pixiv"
object, that is, one level above the "user"
and "bookmark'
objects, right? This way, both definitions from each object get overwritten with the customized output format. A bit non-obvious, but this might just be me. And as long as it's working, nothing to complain here ;-)
Because pixiv seems to be a bit of a special case here.
What you have discovered here is true for all extractors, especially those with more than one extractor per module, and not just pixiv. In general the configuration value located the "deepest" inside the dictionary- or object-tree is used. If non is found, the config system falls back to the default value.
An example:
{
"extractor":
{
"pixiv":
{
"user": { "filename": "A" },
"filename": "B"
},
"deviantart":
{
"image": { "filename": "C"}
},
"filename": "D"
}
}
With a configuration file like the one above, the following is going to happen:
pixiv.user
extractor will use "A"
pixiv
extractors will use "B"
deviantart.image
extractor will use "C"
"D"
I put these two definitions into the "pixiv" object, that is, one level above the "user" and "bookmark' objects, right? This way, both definitions from each object get overwritten with the customized output format.
Yes, if you have those two definitions at this place, then all pixiv extractors (there are 4 in total) will use these instead of their default format strings.
If you want to dig even deeper, take a look at the inner loop of the config.interpolate function. For example for the pixiv.user
extractor this function gets called like so:
directory = config.interpolate(["extractor", "pixiv", "user", "directory"], default)
This function first searches the top-most level for a value with key "directory"
and stores this value if it finds it. It then descends into the "extractor"
object and, again, searches this level for a value with key "directory"
. The same goes on with "pixiv"
and "user"
until it finally reaches the end.
If at any point something goes wrong and an exception gets thrown, which happens if for example the "pixiv"
object doesn't exist, then the value stored up to this point gets returned.
Okay, got it. Also, found all 4 pixiv extractors ;-)
Very nice, and very flexible. Ultimately, every possible variant can be customized. Excellent.
Just threw some pixiv URLs at the program, can confirm everything works indeed as described! (Including these multiple images per entry/"work", I did a manual recount ;-)
On to the next one..
Okay, this probably is a newbie question, but it looks like exhentai isn't a real site? There is e-hentai, seems like they are related (sister sites?). And you apparently need an e-hentai account first (and some dark magic, probably) before you can use exhentai. I will read a bit into this first.
Pretty sure that is the first time I've ever encountered something like this.
But this theory has a little flaw: If these two sites are indeed related, I'd assume that they don't differ much on the technical side, if at all. But trying some gallery links from e-hentai got me this:
C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047429/525823ef87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047429/525823ef87/'
C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047407/f00ba6d6cf/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047407/f00ba6d6cf/'
C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047272/a003dfb22b/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047272/a003dfb22b/'
C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047010/d8b62a3c87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047010/d8b62a3c87/'
C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047424/0218b04f9c/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047424/0218b04f9c/'
C:\Users\Hrxn>gallery-dl --version
0.8.1-dev
C:\Users\Hrxn>
Or is there another specific reason for this?
exhentai is basically the "dark" version e-hentai with all the non-advertiser-friendly stuff enabled.
You should be able to access this site by doing this:
In the past the domain of the regular site was g.e-hentai.org
and I haven't updated the extractor to also accept e-hentai.org
.
You can just change the URLs a bit and replace the -
with an x
or put a g.
in front. It all falls back to the same code that relies on having access to exhentai.
Okay, made an account, will frequent the site a bit and see how it works out then..
Can't test it before, because https://github.com/mikf/gallery-dl/commit/b603b592cfc17c036f2e7fbbee8f7c7ed4be98ec changed the expression pattern, and that part works, but it's still the exhentai
extractor and therefore requires credentials for authentication. Which is not really an issue, don't get me wrong.
I will test some other sites in the meantime, and will update my initial post accordingly.
I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.
Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. Take a look at this.
I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.
I'm not sure, but other random sources on the Internet indicate that this is actually the case.
Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. [...]
Yes, obviously. That is nice, but it won't be necessary, I've already made an account and started using it a bit. Besides, creating and using different accounts for different sites and services doesn't really bother me at all. If there is some longer gap between my responses, it's only because I'm busy with something else ;-) I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)
Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion:
There is another feature on DeviantArt I wasn't aware of before: The Journal.
I noticed it while using gallery-dl with this profile: http://inkveil-matter.deviantart.com/
The site states: 190 deviations. gallery-dl download: 155 files.
Luckily, there is a statistics page which explains this: http://inkveil-matter.deviantart.com/stats/gallery/
InkVeil-Matter has 93,840 pageviews in total; their 35 journals and the 155 deviations in their gallery were viewed 733,738 times.
35 Journal entries, so 190 in total.
Shamelessly copied from the DeviantArt Wikipedia page:
Journals are like personal blogs for the member pages, and the choice of topic is up to each member; some use it to talk about their personal or art-related lives, others use it to spread awareness or marshal support for a cause.
Not sure if that is useful at all. I clicked around a bit, and saw nothing I would consider as missing. Embeds from their own gallery, or from any other, and some links to some drawing feature of DeviantArt I also didn't know of before: Muro, can be seen when visiting sta.sh for example, which also belongs to them as it seems.
I don't know, not sure If I even really understand this feature yet.
Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)
I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)
Thank you for the suggestion but I'm going to stay with my trusty GPG-encrypted plain text file :)
Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion
Even if this platform here is called an issue tracker, feel free (and even encouraged) to create new "issues" if you want to suggest or request a feature or support for a new site.
There is another feature on DeviantArt I wasn't aware of before: The Journal.
This seems to be just a collection of blog posts, which might contain references to other deviantart- or sta.sh images. There shouldn't be any images missing: 190 deviations consisting of 155 real deviations and 35 journal entries seems about right to me. I could add an extractor to fetch those references and download all the images of a journal entry if you want me to.
Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)
No worries, I don't mind walls of text and actually wasn't aware of the journal or muro, so thanks for telling me.
Sorry asking about it, this tool have a feature to able remembering what image is already downloaded, without checking local directory? iirc on package is already have sqlite dll, right?
Thank you.
No, I am sorry, but such a feature does currently not exist. gallery-dl only skips downloads if a file with the same name already exists, but there is at this time no other way of "remembering" if an image has been downloaded before. SQLite, as you have noted, is already being used, but that is only to cache login sessions and the like across separate gallery-dl invocations.
Feel free to open a separate issue If you want a feature like this being implemented, but please explain in greater detail what you actually want to do and/or need this feature for.
Just saw the new commit adding options for skipping files.
A change from https://github.com/mikf/gallery-dl/commit/fc9223c072ae7bf6d3809704710c7dd8f6a9984b#diff-283aceda91c5f7f10981253611f9f950
def _exists_abort(self):
if self.has_extension and os.path.exists(self.realpath):
raise exception.StopExtraction()
return False
Current extractor run, in this context, means just the 'active' URL, right?
Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this:
gallery-dl --input-file FILE
Maybe a case for an additional option. Or rather not, I'm still not sure about it, need to make up my mind first probably.
Current extractor run, in this context, means just the 'active' URL, right?
Yes.
Each URL gets its own extractor, so the --abort-on-skip
option works for each URL independently. Aborting the run of one URL has no effect an any other URLs.
Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this
The -i/--input-file FILE
option just appends the URLs inside of FILE to the end of the list of all URLs.
gallery-dl -i FILE URL1
is equivalent to gallery-dl URL1 URL2 URL3
if FILE contains URL2 and URL3.
Even if, for example, the download for URL1 gets canceled, URL2 and URL3 will still be processed normally.
Maybe a case for an additional option
An --exit-on-skip
option that just exits the program on any download-skip would certainly be possible.
An
--exit-on-skip
option that just exits the program on any download-skip would certainly be possible.
Yes, for example. I think the current behaviour is just right as the default, we'll see when someone asks for other variants.
Do you plan to add a graphical interface for the program? At least the input fields and the pause / continuation buttons. Also interesting in the possibility of multi-threading and the possibility to plan the uploads one by one via GUI. Yes, I know that it can be done through the console, but still ...
Well, I don't know, but if I may, let me add just this: I wish people would realize how much programming work implementing a GUI actually is. And the thing is, that means actual code, lots of lines of code, only for the GUI, and this gets never used outside of the GUI again. So this is just additional work on the top, without any benefit for the actual underlying code.
No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed. A lot of the features you mentioned can already be done via a (reasonable) terminal and shell plus the (GNU) coreutils that usually come with them and I don't really want to re-implement this. I do realize that the CLI "experience" on Windows is terrible, so maybe, if there is a big enough demand, I might add some sort of GUI in the future, but that will always be low priority.
No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed.
Hmmm...I can try to build a graphical shell on C# for windows version, that will intercept commands from and into gallery-dl, but I'm not sure that it will take a little time.But I will try my best.
Yes, I think that's a good idea.
Also, in my opinion, using CLI on Windows isn't too bad. For many uses cases, standard batch scripts (*.bat/*.cmd
) should be enough, for example starting gallery-dl with multiple/dozens/hundreds of URLs, and if you need plenty of scripting capabilities, you can use gallery-dl within PowerShell.
I wouldn't even know what to use a GUI program for, to be honest. If the program is running, there isn't much to see, because what actually takes the most time is just transferring data across the net, aka downloading. You could add some fancy progress bars, but this doesn't really change anything, in my opinion. Besides, progress bar support can also be done in CLI, via simple text output written to the terminal, like wget and curl for example.
The only thing I can really think of right now is managing your personal usage history of the program, so to speak. That means having all in one central place, a queue for all URLs that are yet to be processed, and an archive of all URLs that already have been done. This would be more of a meta-program, if you think about it, because all this can be done completely independent of gallery-dl. You could also use this program to write the script files for the CLI then 😄
As a starting point, writing processed URLs to archive file(s) would be a good idea, I think. Something along the lines of the --download-archive
option of youtube-dl, for example.
Interesting, although these sites seem so similar (and the Gelbooru site even states "concept by Danbooru"), they are yet so different in terms of implementation and functionality.
I just checked again, Gelbooru support for pools may be pretty much irrelevant, at least for now. Because unlike on Danbooru, where pools are used quite extensively, Gelbooru only seems to have 25 pools in total right now, and there is not really much activity. At least that is what I see here, even with an account on Gelbooru. Although an account enables to create own pools (public and private), allowing to collect different posts there which could then be downloaded. So this might be relevant to potential gallery-dl users, maybe..
Gelbooru only seems to have 25 pools in total right now
There seem to be up to 44500 pools If you take a look at the id
parameter of one of the pool URLs, but the pagination controls for gelbooru's pool- and tag pages seem to be missing. You can get to the next pool page by setting the pid
parameter in the URL (the same mechanism is used on their posts-page): page2 page3 and so on.
What made you check that? 25 inconceivably low for a big site? 😉 But you're right, of course. Incidentally, I found the pagination on Gelbooru! It was blocked by uBlock Origin, which I use on Chrome. Well, not just on Chrome, I use it wherever I can, actually. That means that some entry in one of the filter lists breaks the site... Edit: Not sure what tag page exactly, but apart from pools pagination seems to work for me.
This one: https://gelbooru.com/index.php?page=tags&s=list
AdblockPlus + filter list seems to be causing the same issue.
Ah, okay. Yes, pagination also broken for me on that listing.
Small suggestion:
Add a column to Supported Sites to indicate status of user authentication. Not sure, just a simple "Supported" Yes or No. Or "Required", Yes or No. Or "Required", "Optional"...
By the way, does this case even exist currently? We have extractors that require authentication: Pixiv, Exhentai, and some others And other extractors that don't authenticate at all, right?
Thanks for the suggestion -> done fb1904dd59a95fe7728158666648de5d1dafc52d
And yes, there are two modules with optional authentication: bato.to
and exhentai
.
bato.to only offers a very limited selection of manga chapters if you are not logged in, but it is still usable.
Exhentai tries to fall back to the e-hentai version of the site, which only works for some galleries, and original image downloads aren't available as well.
(I added the fallback mechanism for exhentai only after our discussion, btw. af56887a47c44a6042b0787ba6ced9341ab169a5)
Any chance we can get a manga downloader that groups by volume (and chapter)?
What's the best site for downloading manga? MangaFox and MangaReader leave watermarks. MangaHere leaves an ugly water on each chapter (MangaScreener). MangaPanda leaves MangaReader watermark.
I've been thinking about implementing a way to filter by metadata (something like youtube-dl's --match-filter
option), so maybe, at some point in the future, there might be something like that.
Right now there is only the --chapters
option, which lets you select chapters by index, which is not necessarily the chapter number.
I don't really know which manga site is actually good, but I'd probably suggest kissmanga. Pick your poison, I guess.
Small suggestion:
Clarify usage of extractors, sub-extractors and their options in the Readme or the configuration documentation.
I think the general usage of gallery-dl is pretty straightforward, but a novice user might think that it's not immediately obvious.
Here's what is happening, to my understanding: The user submits one (or more) URL(s) to gallery-dl .
Examples can be seen by running gallery-dl --list-extractors
, but this should maybe mentioned more explicitly. Maybe by including it into the information message printed when running gallery-dl without any arguments at all.
Furthermore, all extractors and variants should be properly documented somewhere, I think by either having a complete list or by explaining the proper syntax of the extractor options. Because right now there are only the 2-3 examples mentioned in the gallery-dl.conf for demonstration. Everything else can be figured out from there on, but again, this is maybe not really obvious (enough).
The underlying principle is already in configuration.rst, so far, so good, but
extractor.*.filename
andextractor.*.directory
etc.
is only the basic part, and it doesn't reflect that you can do more with gallery-dl, by setting the configuration like this:
extractor.*.<sub-extractor>.filename
andextractor.*.<sub-extractor>.directory
etc.
Again, this is probably not immediately obvious, so the usage of the two variables (extractor, sub-extractor) should be more clear.
So far this can only be figured out by looking at the code, at least that is what I did so far. All the extractors are listed here: https://github.com/mikf/gallery-dl/tree/master/gallery_dl/extractor
The actual name of each extractor can be inferred from the filenames, but if I'm not mistaken, this is more of a coincidence, and the actual name of each extractor is defined by the variable category
inside the Python source files.
Along those lines, for every extractor that has sub-extractor variants, these names are defined by the variable subcategory
inside the source.
Thankfully, we're not stuck here, but can use the search on Github: Searching for category Searching for subcategory
I've finally gotten around to actually writing some documentation that should make the use of categories and subcategories more obvious and hopefully addresses some of the issues you mentioned. Now to answer some of your points:
Here's what is happening, to my understanding
- ...
- ...
It's actually simpler than that: Each extractor class has a list of regular expressions (usually containing only 1) which match the whole URL. A match is found by going through all regexps from all extractor classes and applying the regular expressions until one matches the URL. The upper level function for that is only 5 lines of actual code. There is no difference in how the choice between two entirely different extractors or between two related (sub-)extractors is made.
Furthermore, all extractors and variants should be properly documented somewhere ... So far this can only be figured out by looking at the code
There is the list of supported sites and the capabilities listed therein or the output of gallery-dl --list-extractors
, which provides a list of all extractor names and as a results all category-subcategory pairs. I would think that this is pretty much enough. Where do you see a problem with that?
Thanks for clearing that up and extending the documentation, really appreciated! 👍
You're right, gallery-dl --list-keywords URL
provides the correct names of category and subcategory for that URL/Extractor.
And you added this to the documentation, so no problem here:
Each extractor name is structured as CategorySubcategoryExtractor. An extractor called PixivUserExtractor has therefore the category pixiv and the subcategory user.
What I initially had in mind was expanding the output of --list-extractors
a bit.
Right now, snipped:
PinterestPinitExtractor
Extractor for images from a pin.it URL
Example: https://pin.it/Hvt8hgT
PixhostImageExtractor
Extractor for single images from pixhost.org
PixivBookmarkExtractor
Extractor for all favorites/bookmarks of your own account
PixivFavoriteExtractor
Extractor for all favorites/bookmarks of a pixiv-user
Example: http://www.pixiv.net/bookmark.php?id=173530
PixivMeExtractor
Extractor for pixiv.me URLs
Example: https://pixiv.me/del_shannon
PixivUserExtractor
Extractor for works of a pixiv-user
Example: http://www.pixiv.net/member_illust.php?id=173530
PixivWorkExtractor
Extractor for a single pixiv work/illustration
Example: http://www.pixiv.net/member_illust.php?mode=medium&illust_id=966412
Which could then not only list an Example URL, but also something along the lines of
Example: ...
Category: ...
Subcategory: ..
Which ends up as the default directory structure, hence it maybe would make sense to mention this more explicitly. But you're right, this is also somewhat redundant now, with the good documentation for the options, and I feel we're already in bike-shedding territory now 😄
It's just that if you keep directory and filename settings intact, you can easily do incremental updates with gallery-dl, which is nice, obviously. I just wanted to make sure that other users also realize this and maybe make use of it 😉
But there's also something which I stumbled upon which lead me to all this:
--list-extractors
ThreedeebooruPoolExtractor
Extractor for image-pools from behoimi.org
Example: http://behoimi.org/pool/show/27
ThreedeebooruPostExtractor Extractor for single images from behoimi.org Example: http://behoimi.org/post/show/140852
ThreedeebooruTagExtractor Extractor for images from behoimi.org based on search-tags Example: http://behoimi.org/post?tags=himekawa_azuru dress
2. Excerpt from `--list-keywords http://behoimi.org/pool/show/27`
category 3dbooru pool 27 subcategory pool
author darkgray category 3dbooru change 597709 created_at[json_class] Time created_at[n] 183101000 [...]
Yeah, there are a few exceptions to that extractor-name rule for 3dbooru
, 4chan
and 8chan
, as you can't use a digit as the first character of an identifier/class name.
I've updated the output of the --list-extractors
option to include (sub)categories, like you suggested, to have a consistent way of getting these values (06c4cae05b60a24b2f3f89c1aeda41a24c4be9ff).
Class names and --list-keywords
do not work all the time, so I hope this is a better solution, even if that made some parts of the documentation useless or redundant.
Are there any guidelines for contributing if I wanted to add support for another site? Should I just follow similarly to what's already done?
There are currently no actual contributing guidelines, but following what is already there seems fine.You might as well just copy an existing module and modify its code. Ask if you have any questions about how things work or should work.
Some general rules would be:
session
object instead of actually importing requestsSmall suggestion:
Scrolled a bit through the Python documentation, and it seems that pip
always explicitly distinguishes between 'installing' and 'upgrading'. The example here mentions modules specifically, but it's the same mechanism, so I don't think this is any different.
Normally, if a suitable module is already installed, attempting to install it again will have no effect. Upgrading existing modules must be requested explicitly: [..]
The pip-specific documentation doesn't mention it explicitly, but doesn't say otherwise either.
Could be OS dependent, I tested this on Windows and you need to specify the upgrading option here.
I probably stumbled not over this before because I always used the --upgrade
flag intuitively.
So, in conclusion, it would probably make sense to mention this in README.rst
.
By either extending point 1.1 with pip install --upgrade gallery-dl
and pip install --upgrade https://github.com/mikf/gallery-dl/archive/master.zip
or by using these examples in a new inserted 1.2 "Updating via pip" or something..
Now don't ask me about manual installation via python setup.py install
etc. If I remember correctly, I've done this in the past (with another package) and updating should work this way, but only if you keep using this method. Because I wouldn't be surprised if this doesn't work any longer as soon as someone used pip install ..
, because manual installation probably doesn't overwrite then..
hi, i just made a server feature here https://github.com/rachmadaniHaryono/gallery-dl/tree/feature/server, because i want to use gallery-dl with hydrus together.
i'm not quite sure if this feature is within gallery-dl scope, so i make little change as possible to gallery-dl package.
i'm not quite understand what DataJobj data should contain. if there is any example for it it would help me.
I would like for gallery-dl to just stay as a command-line program (like youtube-dl), but having a separate `gallery-dl-server' project/package and adding some features that are needed by that to gallery-dl itself would be fine.
The data
member of the DataJob class just holds a list of all "Messages" that an extractor emits. These usually cause the handle_url()
, etc. methods to be called, but in this case they just get stored in the data
list and later written to a file or stdout.
You should probably just create your own Job subclass and overwrite the handle_...
methods.
In your case it is probably enough to focus on the "Url" messages. ("Directory" is there to create the target directory for the following images; "Queue" is supposed to offload its URL to another extractor; in the past there were also "Headers" and "Cookies" messages, but these have become obsolete)
class CustomJob(Job):
def hande_url(self, url, metadata):
# url is the download-URL of the image as a string
# metadata is a dictionary
print(url)
job = CustomJob(input_url)
job.run() # prints all image URLs
I would like for gallery-dl to just stay as a command-line program (like youtube-dl), but having a separate `gallery-dl-server' project/package and adding some features that are needed by that to gallery-dl itself would be fine.
i think i will keep it on the fork for now, because no one demand it for now. having to create another repo ask for fragmentation problem.
is getting url only faster than getting its metadata?
is there structure of how this list made?
from what i know 1st element is (1,1)
, 2nd, is (2, gallery_data_dict)
, 3rd and so is (index, url, url_data_dict)
. if there is any error only (error_name, error_str)
.
You should probably just create your own Job subclass and overwrite the handle_... methods. In your case it is probably enough to focus on the "Url" messages. ("Directory" is there to create the target directory for the following images; "Queue" is supposed to offload its URL to another extractor;
i am still not understand this paragraph.
right now i have no idea how to present the gallery-list and gallery itself on html. you said that focus on url but with metadata it is more helpful.
the current server still present the data on url, because the models still not good enough (see below)
/gallery?data=%5B3%2C%20"https%3A//cdnio.luscious.net/AwronZizao/289557/tumblr_n8yyhlc7jf1tfyunpo1_1280_01BPMTYQ38Q78CHCVTC1YTE3EH.jpg"%2C%20%7B"artist"%3A%20null%2C%20"count"%3A%20"78"%2C%20"extension"%3A%20"jpg"%2C%20"image-id"%3A%20null%2C%20"lang"%3A%20null%2C%20"language"%3A%20null%2C%20"name"%3A%20"N8Yyhlc7Jf1Tfyunpo1%201280"%2C%20"num"%3A%201%2C%20"section"%3A%20"Hentai"%2C%20"tags"%3A%20null%2C%20"title"%3A%20"Socks/Stockings"%7D%5D#
E:also not quite sure if subclassing bare Job class. Right now I am only using JobData because it is the only function that give only Metadata and pic url.
It would be very helpful if the extracting data part and output part on DataJob class' run method are separated.
i think i will keep it on the fork for now, because no one demand it for now. having to create another repo ask for fragmentation problem.
Can't you just depend on the gallery-dl package and have the GUI or server as its own separate package like, for example, youtube-dl-gui does with youtube-dl?
is getting url only faster than getting its metadata?
No, it isn't. The extractors always provide both, but getting the metadata usually takes no extra time for them (i.e. no extra HTTP requests).
is there structure of how this list made?
The list just contains the message-tuples that the extractor emits. The first item in each tuple is always the message-identifier (one of these constants here: message.py and the additional items are the arguments for this particular message type.
The (1,1)
originated from a yield Message.Version, 1
here,
the (2, gallery_data_dict)
came from a yield Message.Directory, gallery_metadata
and so on.
You shouldn't rely on these tuples being in any particular order as this can vary per extractor-class.
you said that focus on url but with metadata it is more helpful.
I said to focus on URL-messages, which contain the actual URL and its metadata (See the code-example above)
edit: i've written some simple example code that might be helpful: https://gist.github.com/mikf/0e591c7ef290097f29adb662ae730424
Bear in mind that calling job.run()
might take a really long time.
thanks for the snippet. that look easier than using the current subclass of DataJob
Can't you just depend on the gallery-dl package and have the GUI or server as its own separate package like, for example, youtube-dl-gui does with youtube-dl?
if there is any interest on this i will make a repo out of it.
about gallery_data_dict and url_data_dict, i suppose it depend on each parser and there is no definite key and value type?
I said to focus on URL-messages, which contain the actual URL and its metadata (See the code-example above)
you are right. i think i misunderstand the url and url-message
about gallery_data_dict and url_data_dict, i suppose it depend on each parser and there is no definite key and value type?
Yes, as there doesn't seem to be a good way of fitting all the different metadata variants of each parser/extractor into a single schema. Similar extractors usually have the same metadata-keys, but you can always check these with gallery-dl --list-kewords <URL>
or you just print the key-value-pairs of the metadata dict directly.
What about my last suggestion? Not a good idea? To all you Python experts here this is maybe painfully obvious, I know, but what about other (hypothetical) users? I just want to avoid any possible pitfalls, not sure, maybe issue reports made by someone with outdated versions, unwittingly..
@mikf then how does one define the key? what if someone made tag
and other made tags
?
also is the dict always one dimensional? i mean on luscious tag
key's value it could be made into list instead of csv string.
@Hrxn like this?
e: ignore it. you are right, it is already on readme and i skip reading the section
$ wget https://github.com/mikf/gallery-dl/archive/master.zip
$ unzip master.zip
# or
$ git clone https://github.com/mikf/gallery-dl.git
$ cd gallery-dl
$ python setup.py install
# or
$ pip install .
# or if you want to upgrade it
$ pip install --upgrade .
also pip can also install directly from github
$ pip install git+git://github.com/mikf/gallery-dl.git
reference https://stackoverflow.com/questions/8247605/configuring-so-that-pip-install-can-work-from-github
Yea, not only directly from GitHub, pip can also install from any source archive, i.e.
pip install https://github.com/mikf/gallery-dl/archive/master.zip
That is already mentioned in README.rst, but updating doesn't work this way, at least on Windows. To be fair, it's mentioned in the info message, but it is easy to overlook, in my opinion, especially for users not used to this stuff..
What about my last suggestion?
Oh. I apologize. I was a bit busy last week and then I kind of forgot about this.
After some tests it seems that pip install --upgrade
uses a reasonable default behavior for installing and upgrading, so I've changed README.rst to that (0b576cc131f48a86062bb6349054344fd611d23d). It may also be necessary to mention the use of sudo
, the admin-console in Windows, the --user
flag or even how to install pip itself for older Python versions, but I think this may be a bit too much and the pip documentation, which is linked in the README, mentions basically everything a user needs to know.
@rachmadaniHaryono
I'm not entirely sure what you mean by "define the key", but if this is about choosing an appropriate name, then just use something reasonable that describes what the value is about and try to be consistent with other similar extractors, so don't use tag
if tags
is used everywhere else.
The dictionary can also be multidimensional (for example pixiv
, deviantart
, flickr
) and the tags
value could as well be a list, but there hasn't been a need for that specific thing up till now.
@mikf is it possible to add extractor from outside of the program?
^ Could you please elaborate a bit on what you mean by "outside of the program"? Some kind of extractor from another Python project? Some other 3rd-party program?
Some kind of extractor from another Python project? Some other 3rd-party program?
yes. afaik to add a new extractor, someone have to create PR to the program.
is it possible to programmatically add extractor and use gallery-dl downloader method?
It is possible to add "outside"-extractors to gallery-dl, although this method is not particularly clean. Take a look at this: https://gist.github.com/mikf/94199249d1eb0b9d82726f178661d831
Looking at code, does the priority on the index depend on 'cache' 's index?
A central place for these things might be a good idea.
This thread could serve as a starting point, results will eventually be collected in the project wiki, if appropriate and useful.
Edited 2017-04-15 For conciseness
Edited 2017-05-04 Removed nonsensical checklist thing