Open mikf opened 9 months ago
Yep, this is just an additional step. It will still load options from "redgifs"
when they are not specified for "reddit>redgifs"
.
@mikf I would like to use base-directory
as a keyword within the config file to use relative paths within the base directory without it breaking when I use --directory
, as in the example below:
"pixiv": {
"postprocessors": [
{
"name": "python",
"event": "prepare",
"function": "{base-directory}/utils.py:pixiv_tags"
}
]
}
is it possible to do so?
@mikf Congrats for making it into the GitHub 10k stars club! 🔥
@AyHa1810
The path for function
(paths in general, really) do not support {…}
replacement fields, only environment variables and home directories ~
. Otherwise you'd be able to access base-directory
by enabling metadata-path
. It would probably be best to define an environment variable and use it for both base-directory
and function
.
Also, --directory
overrides your config's base-directory
value, so accessing it then wouldn't even result in the same value as the one specified in your config.
@Hrxn Thank you.
Correct me if I'm wrong, but it looks like when using things like "skip": "abort:1"
together with "archive", it counts not only items existing in archive as "skipped" (so count towards 4), but also the ones that have files existed.
Is there a way to make it only count existing items in "archive" as skipped, but not the ones that are have existing files (but preferably still not redownload these)?
Basically, what I want to accomplish is to find a way to periodically download all posts until reached the last downloaded record (so abort:1). But between two download sessions, I may have already downloaded some of these posts manually and put the files into the folder already. I don't want these to terminate my download session prematurely.
@fireattack Suggestion: Use different "archive-format"
settings for different sub-extractors, this way you can download some posts manually and entire user profiles etc. independent of each other.
hello, i don't really understand what the difference is between sleep:
and sleep-request:
could someone eli5 please?, particularly in the context of downloading a twitter profile
@AyHa1810 The path for
function
(paths in general, really) do not support{…}
replacement fields, only environment variables and home directories~
. Otherwise you'd be able to accessbase-directory
by enablingmetadata-path
. It would probably be best to define an environment variable and use it for bothbase-directory
andfunction
.Also,
--directory
overrides your config'sbase-directory
value, so accessing it then wouldn't even result in the same value as the one specified in your config.@Hrxn Thank you.
I mean before it gets converted to path, its just a string, right? So it should be possible imo
also yeah I do want it to get override with the --directory
option :P
@fireattack
Files skipped by archive
or existing files are treated the same and there is currently no way to separate them.
@docholllidae
--sleep
causes gallery-dl to sleep before each file download.
--sleep-request
causes gallery-dl to sleep before each non-download HTTP request like loading a webpage, API calls, etc.
It is usually the latter that gets restricted by some sort of rate limit, as is the case for Twitter.
@mikf the env var method works, thanks for the suggestion!
I think I haven't come across any ugoira using PNGs for its images. Does anyone have an example they could share?
how do I prevent myself from getting banned on Instagram?
I'm currently using:
--sleep 2-10
--sleep-request 15-45
should I increase those numbers?? how much?
(are there any other parameters that I can use to prevent myself from being banned on IG?)
How to put artist name in file path for e-hentai? because the "artist:xxx"
is in tags
. I can't find a variable for "directory": ["{artist}"]
.
@taskhawk
I slightly modified the Danbooru extractor to have it go through all ugoira
posts uploaded there (https://danbooru.donmai.us/posts?tags=ugoira), and non of them had .png
frames.
I'm aware that this is just a small subset, but at least its data can be accessed a lot faster than on Pixiv itself.
@throwaway26425
Using the same --user-agent
string as the browser you got your cookies from might help.
Updating the HTTP headers sent during API requests is also something that needs to be done again ...
@Immueggpain See https://github.com/mikf/gallery-dl/discussions/2117
Using the same --user-agent string as the browser you got your cookies from might help.
I'm using -o browser=firefox
, is that the same?
or, do I need to use both?
-o browser=firefox
--user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0"
Updating the HTTP headers sent during API requests is also something that needs to be done again.
I don't understand this, can you please explain it better? :(
I'm using
-o browser=firefox
, is that the same?
browser=firefox
overrides your user agent to Firefox 115 ESR, regardless of your --user-agent
setting. It also sets a bunch of extra HTTP headers and TLS cipher suites to somewhat mimic a real browser, but maybe you're better off without this option.
I don't understand this, can you please explain it better? :(
The instagram code sends specific HTTP headers when making API requests, which might now be out-of-date, meaning I should update them again. The last time I did this was October 2023 (969be65d0b886a956d9b6ac84d315ff38b228b65).
I'm pretty sure this has been asked before but can't find it.
My goal is to run gallery-dl as a module to download, while also get the record of processed posts (URLs, post ids) so I can use that info to do some custom functions.
I've read #642, but I still don't quite get it. It looks like you have to use DownloadJob
for downloading, but in parallel use DataJob
(or even a customized Job
) to get the data?
My current code is pretty simple, just
def load_config():
....
def set_config(user_id):
....
def update(user_id):
load_config()
profile_url = set_config(user_id)
job.DownloadJob(profile_url).run()
I tried to patch DownloadJob
's handle_url
so I can save the URLs and metadata into something like self.mydata
, but that isn't enough because in handle_queue
, it creates a new job
with job = self.__class__(extr, self)
for actual downloading, which makes it more complicated than I want in order to pass the data back to "parent" instance.
So I'm curious if there is an easier way to just do it other than re-write a whole new Job
? Thanks in advance.
I have a suggestion, though I'm not sure how feasible or practical it would be.
Currently behavior:
num
starts at 1
for all postsnum
starts at 0
for all postsnum
starts at 1
for posts containing multiple images and is 0
for posts containing oneCould the behavior for indices be made consistent across all sites?
Hi! Is it possible to provide a "best practices" of sorts for using gallery-dl with Instagram? Things like using the same user-agent as that of the browser cookies are extracted from (source, which values to use for sleep
and sleep-request
, best parameters for the config, etc.
To that end, are there any plans on updating the HTTP headers that are sent for Instagram API calls? Is this something an end-user could update, and if so, where could we find such headers if we wanted to change this for ourselves?
Unrelatedly, is there any way to set up the running of gallery-dl to stop whenever an error occurs? I know there are errors and warnings when gallery-dl is ran, so I'm wondering if there's a way either via Bash or perhaps Python where I can stop if it encounters an error (applicable when I'm passing a list of URLs).
Thank you so, so much for taking the time to read this!
Is there a flag you can set in a "prepare" post-processor to stop a different "prepare" post-processor from occurring?
is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?
is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?
There are extractor.instagram.previews
and extractor.artstation.previews
, butt I can't seem to find a way for Twitter.
I'm using --sleep-request 8.0
and --sleep 1.0-2.0
to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?
I'm using
--sleep-request 8.0
and--sleep 1.0-2.0
to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?
I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45]
and "sleep": [2,10]
, I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.
Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.
I'm using
--sleep-request 8.0
and--sleep 1.0-2.0
to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?I use a JSON config rather than the command-line, but when I use
"sleep-request": [15,45]
and"sleep": [2,10]
, I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.
thanks. another question, can I download/extract my followed users list? if so, how?
edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there
I'm using
--sleep-request 8.0
and--sleep 1.0-2.0
to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?I use a JSON config rather than the command-line, but when I use
"sleep-request": [15,45]
and"sleep": [2,10]
, I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf. Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.thanks. another question, can I download/extract my followed users list? if so, how?
edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there
I'm not aware of a way to do that, I just have a list of IG accounts that I read from and append new accounts. What you could do is visit your following list and scroll down until there aren't anymore accounts to render, then read the HTML and extract all of the profile links that way.
Hello, can someone please explain how to filter out values from lists for example tags[] i tried something like this, but it didnt worked out: "filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"
Hello, can someone please explain how to filter out values from lists for example tags[] i tried something like this, but it didnt worked out: "filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"
EDIT: my original suggestion was unnecessarily complicated, also, as pointed out by mikf, the correct parameter is "image-filter"
"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"
should work, provided that tags
is the name of the metadata containing tags for the extractor you are using (check with gallery-dl -K
)
@fireattack
I'd create a new Job
class that inherits from DownloadJob
, extends dispatch()
to store all messages, and overwrites handle_queue()
to get the data collected by childen.
There really isn't an easier way, but you can more or less just copy-paste the relevant code parts and delete whatever you don't really need.
@climbTheStairs I do plan on doing this for v2.0 as well as adding options that control enumeration behavior.
@Vetches I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...
You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR
in its network monitor, and browsing the site.
Stopping gallery-dl on errors is possible with the actions
option:
"actions": {"error": "exit"}
@biggestsonicfan
There isn't, but couldn't you use "event": "init"
instead? It triggers only once before the first file.
@noshii117 You can get a list of an account's followed users with
gallery-dl -g https://www.instagram.com/USER/following
where USER
is your account's name. You can write them to a file by redirecting stdout with >
.
gallery-dl -g https://www.instagram.com/USER/following > followed_users.txt
@pt3rrorduck
The config file name for --filter
is image-filter
for ... reasons.
Also, 'tags'
should probably be a variable name instead of a string.
"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"
Thank you,
"image-filter":"any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])",
solved it,
but it only works with ' ' around 'tags'. Without it i get following error:
FilterError: Evaluating filter expression failed (NameError: name 'tags' is not defined)
Thank you so much as always for the incredibly helpful reply!
I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...
You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR in its network monitor, and browsing the site.
Oh wow, that's quite interesting! So how can gallery-dl function if IG uses different endpoints with unknown query parameters? Or do you mean that there are effectively two sets of endpoints, the ones used by gallery-dl and the ones with the unknown query parameters? Does this potentially mean that the gallery-dl endpoints could become deprecated at some point, at which point we'd have to figure out what those query parameters do?
Stopping gallery-dl on errors is possible with the actions option:
Amazing, this is just what I was looking for, thank you so much!
@pt3rrorduck
What site are you trying to use this on? Only some provide a list of tags
, for some "tags" are available under a different name, and most of the time there are no tags available at all and you'd end up with a NameError
exception when trying to access an undefined tags
value.
With ' ' around 'tags', this only checks if any of your tags can be found inside the word tags
.
@Vetches Yep, IG has multiple ways / API endpoints to access its data, which does mean that the ones currently used by gallery-dl could get deprecated or outright removed.
Hmm, alright. how about passing the entire json dump to a pre/post processor, is that possible?
@mikf
i tried Redgifs and Newgrounds.
Here is an example from metadata JSON:
"tags": [ "arlong", "banbuds", "east-blue", "eastblue", "fanart", "krieg", "kuro", "luffy", "morgan", "one-piece" ],
@biggestsonicfan Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?
Whatever it is, you're most likely best off using a python
post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.
@pt3rrorduck
As it turns out, there has been a similar issue in the past (#2446) where I've added a contains()
function as a workaround:
"image-filter": "contains(tags, ['tag1', 'tag2', 'tag3'])"
Python can't see local variables like tags
inside a generator expression like any(…)
, it seems.
@biggestsonicfan Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?
Whatever it is, you're most likely best off using a
python
post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.
Yes, that's what I intend to do. I have a lot of kemono data without metadata, and a lot of incorrectly named files because of that. Currently I download the metadata for all revisions of a post (which can lean upwards of 20,000 json files per post) and then try to parse based on the partial filename and hash from the json metadata. I then delete the 19,999 unused json metadata files. It would be nice to just send the json to python, check if it's the file I am looking for, and if so, rename and dump the json to the folder. If not, it doesn't download the json in the first place.
EDIT: I'm just passing data to a python script via a command post-processor but I can't seem to find the right formatted string. Would it be something like {extractor.post}
?
EDIT2: I can see the json metadata can be output to stdout, but I don't see how this can be combined with an exec
name.
for Instagram, is there a way to use -o include=all
but at the same time exclude tagged posts?
Just use include=posts,reels,stories,highlights,avatar
? I agree it's weird to include tagged
in all
, but it's just a hardcoded list so very easy to workaround.
so, there's no exclude
option?
btw, is this the right order that all
uses?
include=avatar,stories,highlights,posts,reels,tagged
?
I have another question about how the actions
option works as per @mikf's post here!:
So if I just want to have gallery-dl stop processing when it runs into any error (for me, that would mostly just be Instagram's HttpError
), would I do something like the following?:
"actions":
"error:/HttpError: '/g": "exit 0"
}
If so, I'm curious about the following situations:
[instagram][info] Use '-o cursor=...
message, since the error message is rendered / returned first?HttpError: '
, and it was printed to the console, would that match the above actions
property? Or is gallery-dl smart enough to only check against its own "system" messages, and ignores anything from the post itself?Any insight on this would be greatly appreciated! Thank you so much!
is there a way to use gallerydl to download img or vid urls even if unsupported? because i want to make a script to use gallerydl archive and its preprocessors like mtime
@Vetches
It's just a plain regex (or search text in this case) without any /
or flags:
"actions": {
"error:HttpError: '": "exit 0"
}
1): This exit is a "controlled" shutdown, so it will still run all finalize()
methods and print the current cursor value.
2): The regex can only ever match logging messages ("system" messages), which do not include the names of downloaded files.
@AdamSaketume26 gallery-dl supports downloading from direct links to a bunch of known filename extensions
$ gallery-dl https://upload.wikimedia.org/wikipedia/commons/2/2f/Hubble_ultra_deep_field.jpg
upload.wikimedia.org_wikipedia_commons_2_2f_Hubble_ultra_deep_field.jpg
In my opinion, you'd be better off using curl, wget, aria2c, etc, but you do you.
@AdamSaketume26 gallery-dl supports downloading from direct links to a bunch of known filename extensions
$ gallery-dl https://upload.wikimedia.org/wikipedia/commons/2/2f/Hubble_ultra_deep_field.jpg upload.wikimedia.org_wikipedia_commons_2_2f_Hubble_ultra_deep_field.jpg
In my opinion, you'd be better off using curl, wget, aria2c, etc, but you do you.
could you add m3u8 to the known extensions please so it can get video from m3u8 when using direct link? pls
gallery-dl can't download from m3u8 manifests. Use yt-dlp (or even ffmpeg directly) for those.
@lambdadeltakay
It is possible to add arbitrary Python symbols (functions, classes, etc) to the global filter namespace using the globals
option.
With that you can import a custom function and call it with locals()
as argument to give it access to all available metadata fields.
I'm writing a post-processor that extracts google drive download links from the post contents. I could simply let the python script handle downloading the file, but it would be nice to be able to provide the direct download link to gallery-dl and let it handle the download.
Is there currently any mechanism for a post-processor to provide additional download URLs to gallery-dl? One thing that's currently particularly difficult is determining the correct output directory for files associated with the post since gallery-dl does not provide that as a parameter.
I'm writing a post-processor that extracts google drive download links from the post contents. I could simply let the python script handle downloading the file, but it would be nice to be able to provide the direct download link to gallery-dl and let it handle the download.
That sounds like it could be out of scope for this project, tbh. To handle download links from Google Drive, Mega, Dropbox, MediaFire and others extracted from posts I use JDownloader2. It's cross-platform, gets updated constantly to keep up with site changes and it's handy because it lets you add account details for sites that need it.
I use the Folder Watch addon and what I do with gallery-dl is that in the post-processor where it extracts download links I write a .crawljob file with details like the download link, the target folder, if archive files should be automatically extracted, etc. The .crawljob files are written to a specific folder which JDownloader checks periodically and it manages the downloads on its own. Check out this link for more info on the Folder Watch addon: https://support.jdownloader.org/en/knowledgebase/article/folder-watch-basic-usage
Is there currently any mechanism for a post-processor to provide additional download URLs to gallery-dl?
This one I don't know. I think it does not. You can call gallery-dl
again, of course, but it will be a separate instance.
One thing that's currently particularly difficult is determining the correct output directory for files associated with the post since gallery-dl does not provide that as a parameter.
When using command
in an exec post-processor you have access to {_path}
and {_directory}
which should contain the output directory for the post according to your configuration. Check: https://gdl-org.github.io/docs/configuration.html#exec-command
I'd simply rather have gallery-dl be my one-stop downloader for everything that I configured it to crawl. The way I handle it is by using the official download API for each of those services, that way I don't have to worry about site changes. For Patreon it's pretty simple most of the time and I just do this:
{
"name": "exec",
"event": "post",
"filter": "embed and 'mega.nz' in embed['url']",
"command": ["mega-get", "{embed['url']}", "{_directory}{id}/"]
}
However some posts have links in the post content and often have multiple links, so for the post content I run a python
post-processor instead. Except the python post-processor doesn't have access to the {_directory}
parameter. Though I guess I could just run python through exec
and pass {_directory}
as a parameter that way.
Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.
Links to older issues: #11, #74, #146.