Closed mikf closed 4 months ago
simple snippet to turn gallery-dl into api
from types import SimpleNamespace
from unittest.mock import patch, Mock
import os
import click
from flask.cli import FlaskGroup
from flask import (
Flask,
jsonify,
request,
)
from gallery_dl import main, option
from gallery_dl.job import DataJob
def get_json():
data = None
parser = option.build_parser()
args = parser.parse_args()
args.urls = request.args.getlist('url')
if not args.urls:
return jsonify({'error': 'No url(s)'})
args.list_data = True
class CustomClass:
data = []
def run(self):
dj = DataJob(*self.data_job_args, **self.data_job_kwargs)
dj.run()
self.data.append({
'args': self.data_job_args,
"kwargs": self.data_job_kwargs,
'data': dj.data
})
def DataJob(self, *args, **kwargs):
self.data_job_args = args
self.data_job_kwargs = kwargs
retval = SimpleNamespace()
retval.run = self.run
return retval
c1 = CustomClass()
with patch('gallery_dl.option.build_parser') as m_bp, \
patch('gallery_dl.job.DataJob', side_effect=c1.DataJob) as m_jt:
# m_option.return_value.parser_args.return_value = args
m_bp.return_value.parse_args.return_value = args
m_jt.__name__ = 'DataJob'
main()
data = c1.data
return jsonify({'data': data, 'urls': args.urls})
def create_app(script_info=None):
"""create app."""
app = Flask(__name__)
app.add_url_rule(
'/api/json', 'gallery_dl_json', get_json)
return app
@click.group(cls=FlaskGroup, create_app=create_app)
def cli():
"""This is a script for application."""
pass
if __name__ == '__main__':
cli()
e: this could be simple when using direct DataJob to handle the urls, but i haven't check if there is anything have to be done before initialize DataJob instance
this could be simple when using direct DataJob to handle the urls, but i haven't check if there is anything have to be done before initialize DataJob instance.
You don't need to do anything before initializing any of the Job classes:
>>> from gallery_dl import job
>>> j = job.DataJob("https://imgur.com/0gybAXR")
>>> j.run()
[ ... ]
You can initialize anything logging related if you want logging output,
or call config.load()
and config.set(...)
if you want to load a config file and set some custom options,
but none of that is necessary.
@rachmadaniHaryono what does that code do?
simpler api (based on above suggestion)
#!/usr/bin/env python
from types import SimpleNamespace
from unittest.mock import patch, Mock
import os
import click
from flask.cli import FlaskGroup
from flask import (
Flask,
jsonify,
request,
)
from gallery_dl import main, option
from gallery_dl.job import DataJob
from gallery_dl.exception import NoExtractorError
def get_json():
data = []
parser = option.build_parser()
args = parser.parse_args()
args.urls = request.args.getlist('url')
if not args.urls:
return jsonify({'error': 'No url(s)'})
args.list_data = True
for url in args.urls:
url_res = None
error = None
try:
job = DataJob(url)
job.run()
url_res = job.data
except NoExtractorError as err:
error = err
data_item = [url, url_res, {'error': str(error) if error else None}]
data.append(data_item)
return jsonify({'data': data, 'urls': args.urls})
def create_app(script_info=None):
"""create app."""
app = Flask(__name__)
app.add_url_rule(
'/api/json', 'gallery_dl_json', get_json)
return app
@click.group(cls=FlaskGroup, create_app=create_app)
def cli():
"""This is a script for application."""
pass
if __name__ == '__main__':
cli()
gug for hydrus (port 5013)
@rachmadaniHaryono instructions on using this GUG and combing it with Hydrus? Any pre-configurstions besides pip3 install gallery-dl
?
script.py
)pip3 install flask gallery-dl
(add --user
if needed)python3 script.py --port 5013
@rachmadaniHaryono add that to the Wiki in https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts if you can, sounded like a really good solution. Also, why port 5013, is that port specifically used for something?
Also, why port 5013, is that port specifically used for something
not a really technical reason. i just use it because the default port is used for my other program.
add that to the Wiki in CuddleBear92/Hydrus-Presets-and-Scripts if you can
i will consider it, because i'm not sure where to put that
another plan is fork (or create pr) for server command but i'm not sure if @mikf want pr for this
@rachmadaniHaryono https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts/wiki Also I would like @mikf to have a look at this, since this is pretty useful. BTW, what is the speed overhead of using this over having a separate txt file like the one in https://github.com/Bionus/imgbrd-grabber/issues/1492 ?
BTW, what is the speed overhead of using this over having a separate txt file like the one in Bionus/imgbrd-grabber#1492 ?
this depend on hydrus vs imgbrd-grabber download speed. from my test gallery-dl give direct link, so hydrus don't have to process the link anymore.
another plan is fork (or create pr) for server command but i'm not sure if @mikf want pr for this
I've already had something similar to this in mind (implementing a (local) server infrastructure to (remotely) send commands / queries: gallery-dl --server
), so I would be quite in favor of adding functionality like this.
But I'm not so happy about adding flask
as a dependency, even if optional. I just generally dislike adding dependencies if they aren't absolutely necessary. I was thinking of using stuff from the http.server
module in Python's standard library if possible.
Also: the script you posted here should be simplified quite a bit further. For example there is no need to build an command line option parser. I'll see if I can get something to work on my own.
A few questions from me concerning Hydrus
But I'm not so happy about adding flask as a dependency, even if optional. I just generally dislike adding dependencies if they aren't absolutely necessary. I was thinking of using stuff from the http.server module in Python's standard library if possible.
this still depend on how big will this be. will it just be an api or there will be html interface for this. although an existing framework will make it easier and the plugin for the framework will let other developer create more feature they want.
of course there is more better framework than flask as example, e.g. sanic, django but i actually doubt if using the standard will be better than those.
Also: the script you posted here should be simplified quite a bit further. For example there is no need to build an command line option parser.
that is modified version from flask cli example. flask can do that simpler but it require to set up variable environment which add another command
The whole thing is written in Python, even version 3 since the last update. Isn't there a better way of coupling it with another Python module than a HTTP server? As in "is it possible to add a native "hook" to make it call another Python function"?
hydrus dev is planned to make api for this on the next milestone. there is also other hydrus user which make unofficial api but he didn't make one for download yet. so either wait for it or use existing hydrus parser
Is there any documentation for the request and response data formats Hydrus sends to and expects from GUG's? I've found this, but that doesn't really explain how Hydrus interacts with other things.
hydrus expect either html and json and try to extract data based on the parser the user made/import. i make this one for html but it maybe changed on future version https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts/blob/master/guide/create_parser_furaffinity.md .
if someone want to make one, they can try made api similar to 4chan api,copy the structure and use modified parser from existing 4chan api.
my best recommendation is to try hydrus parser directly and see what option is there. ask hydrus discord channel if anything is unclear
can gallery-dl support weibo ? i found this https://github.com/nondanee/weiboPicDownloader but it take too long to scan and dont have ability to skip downloaded files
@rachmadaniHaryono I opened a new branch for API server related stuff. The first commit there implements the same functionality as your script, but without external dependencies. Go take a look at it if you want.
And when I said your script "should be simplified ... further" I didn't mean it should use less lines of code, but less resources in term of CPU and memory. Python might not be the right language to use when caring about things like that, but there is still no need to call functions that effectively do nothing - command-line argument parsing for example.
will it be only api or will there will be html interface @mikf?
e: i will comment the code on the commit
I don't think there should be an HTML interface directly inside of gallery-dl. I would prefer it to have a separate front-end (HTML or whatever) communicating with the API back-end that's baked into gallery-dl itself. It is a more general approach and would allow for any programing language and framework to more easily interact with gallery-dl, not just Python.
album
taghost:port/api/json/1
error)description
is not None
or none
still on port 5013
e: related issue https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts/issues/69
About twitter extractor, we have limited request depend on how many tweets user had right ? if user have over 2k+ media, 99% it can't download full media
@wankio The Twitter extractor gets the same tweets you would get by visiting a timeline in your browser and scrolling down until no more tweets get dynamically loaded. I don't know how many tweets you can access like that, but Twitter's public API has a similar restriction::
https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html
This method can only return up to 3,200 of a user's most recent Tweets. Native retweets of other statuses by the user is included in this total, regardless of whether include_rts is set to false when requesting this resource.
You could try ripme. It uses the public API instead of a "hidden", browser-only API like gallery-dl. Maybe you can get more results with that.
but if i remember, ripme rip all tweet/retweet not just user tweet
For some reason the login with OAuth and App Garden tokens or the -u/-p commands doesn't work with flickr which makes images that require a login to view them not downloadable. But otherwise amazing tool, thank you so much!
today when i'm checking e-hentai/exhentai, it just stucked forever. maybe my ISP is the problem because i can't access e-hentai but exhentai still ok. So i think Oauth should help, using cookies instead id+password to bypass
is there a way to download files directly in a specified folder instread of subfolders? for exemple for the picture to be downloaded in F:\Downloaded\ i tried using gallery-dl -d "F:\Downloaded\" https://imgur.com/a/xcEl2WW but instead they get downloaded to F:\Downloaded\imgur\xcEl2WW - Inklings is there an argument i could add to the command to fix that?
@Mattlau04
Short answer: set extractor.directory to an empty string: -o directory=""
Long answer: The path for downloaded files is build from three components:
base-directory
(that's what you set with -d/--dest
)directory
: a list of format strings; one for each path segmentfilename
: another format stringYou can configure all three of them to fit your needs in your config file, but specifying a format string on the command-line can be rather cumbersome and has therefore no extra command-line argument.
You can however use -o/--option
to set any option value and removing the dynamic directory
part should do what you want.
thanks a lot for the help!
Huh sorry to ask so much stuff in so little time, but in a batchfile, i have this command : gallery-dl -o directory="" -o filename="{id}_{tags}" -d "%~dp0\gallery-dl\images\hypnohub" https://hypnohub.net/post?tags=splatoon and it download the first 4 files fine but then it give me OSError: [Errno 22] Invalid argument here is the verbose:
[gallery-dl][debug] Version 1.8.2-dev [gallery-dl][debug] Python 3.6.7 - Windows-10-10.0.17134-SP0 [gallery-dl][debug] requests 2.20.1 - urllib3 1.24.1 [gallery-dl][debug] Starting DownloadJob for 'https://hypnohub.net/post?tags=splatoon' [hypnohub][debug] Using HypnohubTagExtractor for 'https://hypnohub.net/post?tags=splatoon' [urllib3.connectionpool][debug] Starting new HTTPS connection (1): hypnohub.net:443 [urllib3.connectionpool][debug] https://hypnohub.net:443 "GET /post.json?tags=splatoon&limit=50&page=1 HTTP/1.1" 200 None # F:\Auto upload full splatoon doujin colection\\gallery-...l_eyes splatoon symbol_in_eyes taka-michi topless towel wet # F:\Auto upload full splatoon doujin colection\\gallery-... nintendo splatoon tech_control tentacles tongue tongue_out # F:\Auto upload full splatoon doujin colection\\gallery-...ndo splatoon tech_control tentacles tongue tongue_out visor # F:\Auto upload full splatoon doujin colection\\gallery-... nintendo splatoon tech_control tentacles tongue tongue_out # F:\Auto upload full splatoon doujin colection\\gallery-... nintendo splatoon tech_control tentacles tongue tongue_out [urllib3.connectionpool][debug] https://hypnohub.net:443 "GET //data/image/b30b984c7e231cd2ad5d55aaa533cad6.jpg HTTP/1.1" 200 137174 F:\Auto upload full splatoon doujin colection\\gallery-...ch_control tentacles thighhighs tongue tongue_out underwear [hypnohub][error] Unable to download data: OSError: [Errno 22] Invalid argument: '\\\\?\\F:\\Auto upload full splatoon doujin colection\\gallery-dl\\images\\hypnohub\\77610_ahegao blush bottomless breasts breasts_outside callie_(splatoon) civibes cum cum_in_pussy dazed earrings elf_ears empty_eyes female_only femsub gloves hypnotic_accessory large_breasts lying mole nintendo open_clothes open_mouth panties pussy shirt_lift splatoon splatoon_2 spread_legs sunglasses sweat tank_top tech_control tentacles thighhighs tongue tongue_out underwear.part' [hypnohub][debug] Traceback (most recent call last): File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\job.py", line 55, in run self.dispatch(msg) File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\job.py", line 99, in dispatch self.handle_url(url, kwds) File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\job.py", line 210, in handle_url if not self.download(url): File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\job.py", line 279, in download return downloader.download(url, self.pathfmt) File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\downloader\common.py", line 43, in download return self.download_impl(url, pathfmt) File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\downloader\common.py", line 106, in download_impl with pathfmt.open(mode) as file: File "c:\users\mattl\appdata\local\programs\python\python36\lib\site-packages\gallery_dl\util.py", line 509, in open return open(self.temppath, mode) OSError: [Errno 22] Invalid argument: '\\\\?\\F:\\Auto upload full splatoon doujin colection\\gallery-dl\\images\\hypnohub\\77610_ahegao blush bottomless breasts breasts_outside callie_(splatoon) civibes cum cum_in_pussy dazed earrings elf_ears empty_eyes female_only femsub gloves hypnotic_accessory large_breasts lying mole nintendo open_clothes open_mouth panties pussy shirt_lift splatoon splatoon_2 spread_legs sunglasses sweat tank_top tech_control tentacles thighhighs tongue tongue_out underwear.part'
There are too many tags and the filename got too long (> 255 bytes).
You can shorten the tags string to for example 200 characters with {tags[:200]}
,
or you use {tags:L200/too many tags/}
to replace the content of {tags}
with too many tags
if it exceeds 200 characters.
You should also consider using a config file. It's a lot more readable than packing everything into command-line arguments.
is there no way to remove the 255 bytes limit?
No, there isn't. This is an inherent limitation of most filesystems (see Comparison of file systems (*)).
Instead of saving an image's tags in its filename, you could store it in a separate file with --write-tags
.
(*) NTFS has a limit of 255 UTF-16 code units, not bytes, but that doesn't make much of a difference here.
@mikf after almost two years using gallery-dl, I finally decided to use the archive function. I added the parameter in my configuration file, but only the new media downloaded are written to the file, the previously downloaded media are checked one by one before the command is finished. Is it possible to force previously downloaded media to be written? And do you recommend me to archive globally or archive per extractor? Thank you!
Edit1: I believe that set "skip" to "false" is a solution but I would like one that does not need to download the media files again. Edit2: "abort" is another solution but that would not force writing on the archive file, just workahound the problem
Is it possible to force previously downloaded media to be written?
See #261
And do you recommend me to archive globally or archive per extractor?
It will work either way, but maybe several smaller SQLite3 database files are better/faster than 1 massive one ... not sure. I'm not using an archive file myself, but I'd probably have a general archive file as well as individual ones for my most used sites.
Comment out JSON file. A simple comment, like sublime config does (//
).
Is there any way for the extractor to support search urls from Twitter? The normal download doesn't reach far enough into old tweets, but on the website you can search between specific dates.
How can I download tweet texts in separate txt files? I have enabled the "content" property, using the following config (don't think its correct, but I have no idea...):
{
"extractor":
{
"twitter":
{
"content": true,
"postprocessors": [{
"name": "metadata",
"mode": "custom",
"extension": "txt",
"format": "{content}\n"
}]
}
}
}
As a result, txt files are created, but they contain only the word "None".
Also, is it possible to save the text of those tweets in which there are no media?
Comment out JSON file. A simple comment, like sublime config does (//).
Renaming the keys you want to be ignored isn't an option? (e.g. "option"
-> "_option"
)
Is there any way for the extractor to support search urls from Twitter?
Sure, when I get to it. The Twitter extractors will have to be adjusted to the new layout etc. at some point and I might as well add support for searches then as well.
How can I download tweet texts in separate txt files? I have enabled the "content" property, using the following config (don't think its correct, but I have no idea...)
Hmm, your config file looks OK and does exactly what it's supposed to on my end:
gallery-dl -c your_config.json https://twitter.com/supernaturepics/status/604341487988576256
produces a text file with Big Wedeene River, Canada
in it, like it should.
Is there a content
field in the output of gallery-dl -j <tweet-url>
? And what version are you running?
Okay, i got it :) I used old 1.8.7 version (windows executable, downloaded from mainpage in July). Now I replaced it with new 1.10.1 exe, and everything goes fine. Thanks!
And returning to the my previous message, is it possible to save the text of those tweets in which there are no media?
Is it possible to add to the ehentai / exhentai extractor the ability to submit an archive download to the Hentai@Home downloader? That way people can avoid spending image limits and GP?
@inthebrilliantblue care to elaborate?
@inthebrilliantblue care to elaborate?
On ehentai / exhentai, you can host a cache server called H@H (Hentai at Home). This gives you the ability to, when downloading an album archive through their website, to have your H@H server do it. This would be able to avoid having to load each image through gallery-dl and instead allow an ehentai user to just submit download requests through H@H.
There is an Archive Download link on each image album. Clicking it pulls up a popup that has some options for downloading. On the bottom is the H@H links for multiple quality versions. To the right is the Original Upload selection.
So my question is, using an ehentai login and search term, would it be possible to trigger the Original archive download link so that H@H downloads the album instead of gallery-dl?
Is there any way to add extractor option for Reddit to save filenames as per the name of their posts?
I've tried --list-keywords
option but there were no arguments for Keywords for filename
in the output
How should I configure the extractor so that it downloads the posts with the filename being the name of the post
So I have two issues, but they both pretty much revolve around file and/or directory names.
under extensions.flickr
in my config.json
, I have these options set:
"directory": [
"{user[username]}",
"{album[title]}"
],
"filename": "{user[username]}-{album[title]}_{id}.{extension}",
which works fine if I'm downloading someone's albums. But sometimes I want to download images which aren't in an album - and in that case, the album title is just None
in both the directory name and the file name. Is it possible to omit the {album[title]}
specifier (and the trailing underscore in filename) completely if and when its value is None
?
Second, until recently I've been using another, more-awkward downloader for getting images from Twitter. Despite the awkwardness I have a LOT of images downloaded from that, such that if I can get gallery-dl to download using the same filename pattern that would actually be easier than renaming all the files I already have to match gallery-dl.
The pattern used by the other downloader is roughly equal to "{author[name]}-{tweet_id}-{date:%Y%m%d_%H%M%S}-{'vid' if extension=='mp4' else 'img'}{num}.{extension}"
. (not-so-coincidentally, this is the pattern I've been testing as extractor.twitter.filename
in my config.json) Which is to say, if the downloaded media is an MP4, then it will have the text vid
in front of the index, whereas if it's an image, it'll read img
. For example:
NekoNicoKig-1185043490742460416-20191018_050239-vid1.mp4
NekoNicoKig-1177421053926223873-20190927_041349-img1.jpg
are each names of files I have downloaded already.
Now, the filename pattern I'm using above is, I'm given to understand, valid Python syntax when used in something called an f-string, (I'm using Python 3.8.2, for the record) but apparently the filename isn't an f-string in gallery-dl. That, or I'm doing something wrong. Is there anything I can do from this end? Am I doing something wrong?
@DaWrecka There are two possible ways to go about your first problem:
You could set another pair of filename/directory format strings for the image
subcategory:
"directory": ["{user[username]}", "{album[title]}"],
"filename": "{user[username]}-{album[title]}_{id}.{extension}",
"image": {
"directory": ["{user[username]}"],
"filename": "{user[username]}-{id}.{extension}",
},
or you specify the album[title]
field as optional, for example {album[title]:?/-/}
(more about "special" formatting options here)
Having a different filename for videos might be a bit more involved. Either go the youtube-dl route (https://github.com/mikf/gallery-dl/issues/533), or chain a couple of replace operations to transform mp4
into vid
and everything else into img
: {extension:Rmp4/vid/Rjpg/img/Rpng/img/Rgif/img/}
f-strings would be really nice here, I agree, but dynamic user-specified f-strings aren't possible as far as I know, so they aren't really an option here.
Is there a way to recursively download external content that's linked in Patreon posts? Many link to drive.google.com/drive/folders/URL, drive.google.com/file/d/URL, Imgur etc., especially since the new policies were added. I tried it with "r:patreon.com/URL" and it does follow URLs, but not the right ones. Apologies if this was already answered elsewhere.
For patreon have a method of extracting and sorting posts by the tags and into tag folders. Currently the extractor doesn't actually save the tags
So I've noticed that a lot of the bugs and suggestions for features are essentially asking for more control when manipulating the metadata for directory or filename generation.
Have you looked at using Jinja2 templating and their custom filters?
A lot of the current custom formatting could be implemented as simple functions that operate on input strings or lists of strings. This can be further extended by allowing users to submit filters to be included in gallery-dl (also possible is specifying a run time import of a user-defined *.py
file with their own filters).
The biggest benefit would be able to push a dictionary into the Jinja2 template and the customer filters would be able to operate on the entire object or just on attributes as well as calling any built in python function within the template.
is there no way to remove the 255 bytes limit?
To expand on what @mikf said: both Linux and Window$ (IIRC; no clue about Mac) will fail to write too long names by a massively unhelpful and confusing error about corruption or not existing despite neither being the case.
And since he mentioned keywords in filenames: it's an absolutely bad idea to put keywords into filenames cuz unless the website has write-protected them you're risking the keywords being changed, which potentially changes the resulting filename, like in the case of Derpibooru where each upload gets an incremental integer, and if you opt to download with the keywords included and the keywords change => dupe file.
^ Same for anything that can change, like, for ex, if a piece is about a sunset and thus named "beautiful sunset in X.png", then gets renamed "Beautiful sunset in X.png" => dupe file on non-Window$, and depending on the website will become an invalid filename on their end. Thus, an [U]UID, combined with correct folder structure, is the best.
I don't know if this has been mentioned before, but a download progress bar would be great to keep track of how much you have downloaded, especially for big files.
The above is an example from PixivUtil.
Not sure about that, to be honest. This might give you an ETA for a single file that is currently downloaded, but if you use gallery-dl like it's probably used by most, i.e. in downloading big galleries/collections or entire user profiles/accounts, this won't help you at all, because gallery-dl simply downloads everything returned by the site, like as it is returned by a site's API for example, and therefore a somewhat accurate prediction of the time it takes to finish the entire process is not really possible to do. So, I'm not entirely convinced about the usefulness here..
On the contrary, I'd like a mode where it shows even less. I'd like it to list files downloaded, and completely skip ones already completed. I should probably make this a feature request though.
Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue. There is also https://gitter.im/gallery-dl/main if that seems more appropriate.
Links to older issues: #11, #74