Questions, Feedback, and Suggestions #4

mikf commented 9 months ago

Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.

Links to older issues: #11, #74, #146.

mikf commented 7 months ago

Yep, this is just an additional step. It will still load options from "redgifs" when they are not specified for "reddit>redgifs".

AyHa1810 commented 7 months ago

@mikf I would like to use base-directory as a keyword within the config file to use relative paths within the base directory without it breaking when I use --directory, as in the example below:

"pixiv": {
    "postprocessors": [
        {
            "name": "python",
            "event": "prepare",
            "function": "{base-directory}/utils.py:pixiv_tags"
        }
    ]
}

is it possible to do so?

Hrxn commented 7 months ago

@mikf Congrats for making it into the GitHub 10k stars club! 🔥

mikf commented 7 months ago

@AyHa1810 The path for function (paths in general, really) do not support {…} replacement fields, only environment variables and home directories ~. Otherwise you'd be able to access base-directory by enabling metadata-path. It would probably be best to define an environment variable and use it for both base-directory and function.

Also, --directory overrides your config's base-directory value, so accessing it then wouldn't even result in the same value as the one specified in your config.

@Hrxn Thank you.

fireattack commented 7 months ago

Correct me if I'm wrong, but it looks like when using things like "skip": "abort:1" together with "archive", it counts not only items existing in archive as "skipped" (so count towards 4), but also the ones that have files existed.

Is there a way to make it only count existing items in "archive" as skipped, but not the ones that are have existing files (but preferably still not redownload these)?

Basically, what I want to accomplish is to find a way to periodically download all posts until reached the last downloaded record (so abort:1). But between two download sessions, I may have already downloaded some of these posts manually and put the files into the folder already. I don't want these to terminate my download session prematurely.

Hrxn commented 7 months ago

@fireattack Suggestion: Use different "archive-format" settings for different sub-extractors, this way you can download some posts manually and entire user profiles etc. independent of each other.

docholllidae commented 7 months ago

hello, i don't really understand what the difference is between sleep: and sleep-request: could someone eli5 please?, particularly in the context of downloading a twitter profile

AyHa1810 commented 7 months ago

@AyHa1810 The path for function (paths in general, really) do not support {…} replacement fields, only environment variables and home directories ~. Otherwise you'd be able to access base-directory by enabling metadata-path. It would probably be best to define an environment variable and use it for both base-directory and function.

Also, --directory overrides your config's base-directory value, so accessing it then wouldn't even result in the same value as the one specified in your config.

@Hrxn Thank you.

I mean before it gets converted to path, its just a string, right? So it should be possible imo

also yeah I do want it to get override with the --directory option :P

mikf commented 7 months ago

@fireattack Files skipped by archive or existing files are treated the same and there is currently no way to separate them.

@docholllidae --sleep causes gallery-dl to sleep before each file download. --sleep-request causes gallery-dl to sleep before each non-download HTTP request like loading a webpage, API calls, etc.

It is usually the latter that gets restricted by some sort of rate limit, as is the case for Twitter.

AyHa1810 commented 7 months ago

@mikf the env var method works, thanks for the suggestion!

taskhawk commented 7 months ago

I think I haven't come across any ugoira using PNGs for its images. Does anyone have an example they could share?

throwaway242685 commented 7 months ago

how do I prevent myself from getting banned on Instagram?

I'm currently using:

--sleep 2-10
--sleep-request 15-45

should I increase those numbers?? how much?

(are there any other parameters that I can use to prevent myself from being banned on IG?)

gwttk commented 7 months ago

How to put artist name in file path for e-hentai? because the "artist:xxx" is in tags. I can't find a variable for "directory": ["{artist}"].

mikf commented 7 months ago

@taskhawk I slightly modified the Danbooru extractor to have it go through all ugoira posts uploaded there (https://danbooru.donmai.us/posts?tags=ugoira), and non of them had .png frames. I'm aware that this is just a small subset, but at least its data can be accessed a lot faster than on Pixiv itself.

@throwaway26425 Using the same --user-agent string as the browser you got your cookies from might help. Updating the HTTP headers sent during API requests is also something that needs to be done again ...

@Immueggpain See https://github.com/mikf/gallery-dl/discussions/2117

throwaway242685 commented 7 months ago

Using the same --user-agent string as the browser you got your cookies from might help.

I'm using -o browser=firefox, is that the same?

or, do I need to use both?

-o browser=firefox
--user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0"

Updating the HTTP headers sent during API requests is also something that needs to be done again.

I don't understand this, can you please explain it better? :(

mikf commented 7 months ago

I'm using -o browser=firefox, is that the same?

browser=firefox overrides your user agent to Firefox 115 ESR, regardless of your --user-agent setting. It also sets a bunch of extra HTTP headers and TLS cipher suites to somewhat mimic a real browser, but maybe you're better off without this option.

I don't understand this, can you please explain it better? :(

The instagram code sends specific HTTP headers when making API requests, which might now be out-of-date, meaning I should update them again. The last time I did this was October 2023 (969be65d0b886a956d9b6ac84d315ff38b228b65).

fireattack commented 7 months ago

I'm pretty sure this has been asked before but can't find it.

My goal is to run gallery-dl as a module to download, while also get the record of processed posts (URLs, post ids) so I can use that info to do some custom functions.

I've read #642, but I still don't quite get it. It looks like you have to use DownloadJob for downloading, but in parallel use DataJob (or even a customized Job) to get the data?

My current code is pretty simple, just

def load_config():
    ....
def set_config(user_id):
    ....

def update(user_id):
    load_config()
    profile_url = set_config(user_id)
    job.DownloadJob(profile_url).run()

I tried to patch DownloadJob's handle_url so I can save the URLs and metadata into something like self.mydata, but that isn't enough because in handle_queue, it creates a new job with job = self.__class__(extr, self) for actual downloading, which makes it more complicated than I want in order to pass the data back to "parent" instance.

So I'm curious if there is an easier way to just do it other than re-write a whole new Job? Thanks in advance.

climbTheStairs commented 6 months ago

I have a suggestion, though I'm not sure how feasible or practical it would be.

Currently behavior:

twitter num starts at 1 for all posts
pixiv num starts at 0 for all posts
reddit num starts at 1 for posts containing multiple images and is 0 for posts containing one

Could the behavior for indices be made consistent across all sites?

Vetches commented 6 months ago

Hi! Is it possible to provide a "best practices" of sorts for using gallery-dl with Instagram? Things like using the same user-agent as that of the browser cookies are extracted from (source, which values to use for sleep and sleep-request, best parameters for the config, etc.

To that end, are there any plans on updating the HTTP headers that are sent for Instagram API calls? Is this something an end-user could update, and if so, where could we find such headers if we wanted to change this for ourselves?

Unrelatedly, is there any way to set up the running of gallery-dl to stop whenever an error occurs? I know there are errors and warnings when gallery-dl is ran, so I'm wondering if there's a way either via Bash or perhaps Python where I can stop if it encounters an error (applicable when I'm passing a list of URLs).

Thank you so, so much for taking the time to read this!

biggestsonicfan commented 6 months ago

Is there a flag you can set in a "prepare" post-processor to stop a different "prepare" post-processor from occurring?

fbck commented 6 months ago

is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?

fireattack commented 6 months ago

is it possible to download video thumbnails/preview pictures for twitter (and perhaps other sites too)?

There are extractor.instagram.previews and extractor.artstation.previews, butt I can't seem to find a way for Twitter.

noshii117 commented 6 months ago

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

Vetches commented 6 months ago

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.

Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

noshii117 commented 6 months ago

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf.

Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

thanks. another question, can I download/extract my followed users list? if so, how?

edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there

Vetches commented 6 months ago

I'm using --sleep-request 8.0 and --sleep 1.0-2.0 to download posts from instagram profiles. My account still got flagged by instagram for scraping, despite me doing just one profile every other day or 2 days. Should I increase sleep-request, sleep, or both?

I use a JSON config rather than the command-line, but when I use "sleep-request": [15,45] and "sleep": [2,10], I generally don't get marked for spam super often and can download a few accounts within a day with no issue. I got these numbers from a different IG issue on here per a soft recommend by mikf. Now IG still does detect automation afoot even with these numbers, but you can generally get around that by just switching accounts.

thanks. another question, can I download/extract my followed users list? if so, how?

edit: I asked this because I'm gonna make another 2 accounts with the same followed people. and just to have a back up list of who I follow on there

I'm not aware of a way to do that, I just have a list of IG accounts that I read from and append new accounts. What you could do is visit your following list and scroll down until there aren't anymore accounts to render, then read the HTML and extract all of the profile links that way.

pt3rrorduck commented 6 months ago

Hello, can someone please explain how to filter out values from lists for example tags[] i tried something like this, but it didnt worked out: "filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"

Vrihub commented 6 months ago

Hello, can someone please explain how to filter out values from lists for example tags[] i tried something like this, but it didnt worked out: "filter": "any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])"

EDIT: my original suggestion was unnecessarily complicated, also, as pointed out by mikf, the correct parameter is "image-filter"

"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"

should work, provided that tags is the name of the metadata containing tags for the extractor you are using (check with gallery-dl -K)

mikf commented 6 months ago

@fireattack I'd create a new Job class that inherits from DownloadJob, extends dispatch() to store all messages, and overwrites handle_queue() to get the data collected by childen.

There really isn't an easier way, but you can more or less just copy-paste the relevant code parts and delete whatever you don't really need.

@climbTheStairs I do plan on doing this for v2.0 as well as adding options that control enumeration behavior.

@Vetches I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...

You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR in its network monitor, and browsing the site.

Stopping gallery-dl on errors is possible with the actions option:

"actions": {"error": "exit"}

mikf commented 6 months ago

@biggestsonicfan There isn't, but couldn't you use "event": "init" instead? It triggers only once before the first file.

@noshii117 You can get a list of an account's followed users with

gallery-dl -g https://www.instagram.com/USER/following

where USER is your account's name. You can write them to a file by redirecting stdout with >.

gallery-dl -g https://www.instagram.com/USER/following > followed_users.txt

@pt3rrorduck The config file name for --filter is image-filter for ... reasons. Also, 'tags' should probably be a variable name instead of a string.

"image-filter": "any(tag in tags for tag in ['tag1', 'tag2', 'tag3'])"

pt3rrorduck commented 6 months ago

Thank you, "image-filter":"any(tag in 'tags' for tag in ['tag1', 'tag2', 'tag3'])", solved it,
but it only works with ' ' around 'tags'. Without it i get following error: FilterError: Evaluating filter expression failed (NameError: name 'tags' is not defined)

Vetches commented 6 months ago

Thank you so much as always for the incredibly helpful reply!

I did look into IG's headers some time ago and at least for the API endpoints used by gallery-dl, nothing seems to have changed. The potential problem is that IG now uses a different set of endpoints with query parameters I have no idea what they mean ...

You can find endpoints, headers, and parameters by opening your browser's dev tools (F12), selecting XHR in its network monitor, and browsing the site.

Oh wow, that's quite interesting! So how can gallery-dl function if IG uses different endpoints with unknown query parameters? Or do you mean that there are effectively two sets of endpoints, the ones used by gallery-dl and the ones with the unknown query parameters? Does this potentially mean that the gallery-dl endpoints could become deprecated at some point, at which point we'd have to figure out what those query parameters do?

Stopping gallery-dl on errors is possible with the actions option:

Amazing, this is just what I was looking for, thank you so much!

mikf commented 6 months ago

@pt3rrorduck What site are you trying to use this on? Only some provide a list of tags, for some "tags" are available under a different name, and most of the time there are no tags available at all and you'd end up with a NameError exception when trying to access an undefined tags value.

With ' ' around 'tags', this only checks if any of your tags can be found inside the word tags.

@Vetches Yep, IG has multiple ways / API endpoints to access its data, which does mean that the ones currently used by gallery-dl could get deprecated or outright removed.

biggestsonicfan commented 6 months ago

Hmm, alright. how about passing the entire json dump to a pre/post processor, is that possible?

pt3rrorduck commented 6 months ago

@mikf i tried Redgifs and Newgrounds. Here is an example from metadata JSON: "tags": [ "arlong", "banbuds", "east-blue", "eastblue", "fanart", "krieg", "kuro", "luffy", "morgan", "one-piece" ],

mikf commented 6 months ago

@biggestsonicfan Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?

Whatever it is, you're most likely best off using a python post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.

@pt3rrorduck As it turns out, there has been a similar issue in the past (#2446) where I've added a contains() function as a workaround:

"image-filter": "contains(tags, ['tag1', 'tag2', 'tag3'])"

Python can't see local variables like tags inside a generator expression like any(…), it seems.

biggestsonicfan commented 6 months ago

@biggestsonicfan Do you mean all collected metadata of every file/post? What exactly do you want to accomplish?

Whatever it is, you're most likely best off using a python post processor to run a custom Python function, where you could theoretically also set a flag that could prevent any further post processors from running.

Yes, that's what I intend to do. I have a lot of kemono data without metadata, and a lot of incorrectly named files because of that. Currently I download the metadata for all revisions of a post (which can lean upwards of 20,000 json files per post) and then try to parse based on the partial filename and hash from the json metadata. I then delete the 19,999 unused json metadata files. It would be nice to just send the json to python, check if it's the file I am looking for, and if so, rename and dump the json to the folder. If not, it doesn't download the json in the first place.

EDIT: I'm just passing data to a python script via a command post-processor but I can't seem to find the right formatted string. Would it be something like {extractor.post}?

EDIT2: I can see the json metadata can be output to stdout, but I don't see how this can be combined with an exec name.

throwaway242685 commented 6 months ago

for Instagram, is there a way to use -o include=all but at the same time exclude tagged posts?

fireattack commented 6 months ago

Just use include=posts,reels,stories,highlights,avatar? I agree it's weird to include tagged in all, but it's just a hardcoded list so very easy to workaround.

throwaway242685 commented 6 months ago

so, there's no exclude option?

btw, is this the right order that all uses?

include=avatar,stories,highlights,posts,reels,tagged?

fireattack commented 6 months ago

It's in this order: https://github.com/mikf/gallery-dl/blob/831f922c1c0d2702cdb9fc205115debeee985cec/gallery_dl/extractor/instagram.py#L418-L426

Vetches commented 5 months ago

I have another question about how the actions option works as per @mikf's post here!:

So if I just want to have gallery-dl stop processing when it runs into any error (for me, that would mostly just be Instagram's HttpError), would I do something like the following?:

"actions":
  "error:/HttpError: '/g": "exit 0"
}

If so, I'm curious about the following situations:

If this pattern is matched, does this mean the program won't show the [instagram][info] Use '-o cursor=... message, since the error message is rendered / returned first?
Would there ever be a scenario where the pattern matches with something like a filename or post content from a user? Or does it only match against gallery-dl's prefixing of rendered messages? For example, if by some miraculous chance someone made a post whose content contained HttpError: ', and it was printed to the console, would that match the above actions property? Or is gallery-dl smart enough to only check against its own "system" messages, and ignores anything from the post itself?

Any insight on this would be greatly appreciated! Thank you so much!

AdamSaketume26 commented 5 months ago

is there a way to use gallerydl to download img or vid urls even if unsupported? because i want to make a script to use gallerydl archive and its preprocessors like mtime

mikf commented 5 months ago

@Vetches It's just a plain regex (or search text in this case) without any / or flags:

"actions": {
  "error:HttpError: '": "exit 0"
}

1): This exit is a "controlled" shutdown, so it will still run all finalize() methods and print the current cursor value.

2): The regex can only ever match logging messages ("system" messages), ~~which do not include the names of downloaded files~~.

@AdamSaketume26 gallery-dl supports downloading from direct links to a bunch of known filename extensions

$ gallery-dl https://upload.wikimedia.org/wikipedia/commons/2/2f/Hubble_ultra_deep_field.jpg
upload.wikimedia.org_wikipedia_commons_2_2f_Hubble_ultra_deep_field.jpg

In my opinion, you'd be better off using curl, wget, aria2c, etc, but you do you.

AdamSaketume26 commented 5 months ago

@AdamSaketume26 gallery-dl supports downloading from direct links to a bunch of known filename extensions
$ gallery-dl https://upload.wikimedia.org/wikipedia/commons/2/2f/Hubble_ultra_deep_field.jpg
upload.wikimedia.org_wikipedia_commons_2_2f_Hubble_ultra_deep_field.jpg
In my opinion, you'd be better off using curl, wget, aria2c, etc, but you do you.

could you add m3u8 to the known extensions please so it can get video from m3u8 when using direct link? pls

mikf commented 5 months ago

gallery-dl can't download from m3u8 manifests. Use yt-dlp (or even ffmpeg directly) for those.

mikf commented 5 months ago

@lambdadeltakay It is possible to add arbitrary Python symbols (functions, classes, etc) to the global filter namespace using the globals option.

With that you can import a custom function and call it with locals() as argument to give it access to all available metadata fields.

ZenoArrows commented 5 months ago

I'm writing a post-processor that extracts google drive download links from the post contents. I could simply let the python script handle downloading the file, but it would be nice to be able to provide the direct download link to gallery-dl and let it handle the download.

Is there currently any mechanism for a post-processor to provide additional download URLs to gallery-dl? One thing that's currently particularly difficult is determining the correct output directory for files associated with the post since gallery-dl does not provide that as a parameter.

taskhawk commented 5 months ago

I'm writing a post-processor that extracts google drive download links from the post contents. I could simply let the python script handle downloading the file, but it would be nice to be able to provide the direct download link to gallery-dl and let it handle the download.

That sounds like it could be out of scope for this project, tbh. To handle download links from Google Drive, Mega, Dropbox, MediaFire and others extracted from posts I use JDownloader2. It's cross-platform, gets updated constantly to keep up with site changes and it's handy because it lets you add account details for sites that need it.

I use the Folder Watch addon and what I do with gallery-dl is that in the post-processor where it extracts download links I write a .crawljob file with details like the download link, the target folder, if archive files should be automatically extracted, etc. The .crawljob files are written to a specific folder which JDownloader checks periodically and it manages the downloads on its own. Check out this link for more info on the Folder Watch addon: https://support.jdownloader.org/en/knowledgebase/article/folder-watch-basic-usage

Is there currently any mechanism for a post-processor to provide additional download URLs to gallery-dl?

This one I don't know. I think it does not. You can call gallery-dl again, of course, but it will be a separate instance.

One thing that's currently particularly difficult is determining the correct output directory for files associated with the post since gallery-dl does not provide that as a parameter.

When using command in an exec post-processor you have access to {_path} and {_directory} which should contain the output directory for the post according to your configuration. Check: https://gdl-org.github.io/docs/configuration.html#exec-command

ZenoArrows commented 5 months ago

I'd simply rather have gallery-dl be my one-stop downloader for everything that I configured it to crawl. The way I handle it is by using the official download API for each of those services, that way I don't have to worry about site changes. For Patreon it's pretty simple most of the time and I just do this:

{
    "name": "exec",
    "event": "post",
    "filter": "embed and 'mega.nz' in embed['url']",
    "command": ["mega-get", "{embed['url']}", "{_directory}{id}/"]
}

However some posts have links in the post content and often have multiple links, so for the post content I run a python post-processor instead. Except the python post-processor doesn't have access to the {_directory} parameter. Though I guess I could just run python through exec and pass {_directory} as a parameter that way.

mikf / gallery-dl

Questions, Feedback, and Suggestions #4 #5262