mikf commented 6 years ago

Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue.

11 had gotten too big, took several seconds to load, and was closed as a result.

There is also https://gitter.im/gallery-dl/main if that seems more appropriate.

Hrxn commented 6 years ago

A bit of feedback: I've been downloading some stuff from Tumblr now, and everything seems to be working like a charm so far. Kudos to you. Including --write-unsupported FILE, detecting and saving embedded links from Vine, Instagram and links to other external sites. Safe to assume it'll work just the same with embeds from YouTube and Vimeo, which I thought to be pretty common on Tumblr, but it's entirely possible that the processed blogs so far actually don't had a single one. I try that later on my own test blog, just to be sure.

This also led me to another idea/suggestion: How about a --write-log FILE feature? I'm aware that there is something like tee, and even something similar for PowerShell, but I think the purpose for the logging feature (at least at default setting) would not be to replicate the full output printed to the console, no, only any relevant errors or warnings, i.e. like 404, 403 and also if there occurs some timeout or something, which can be caused easily by connectivity issues, not by gallery-dl itself.

Not sure, just an idea...

mikf commented 6 years ago

Sounded like a useful feature, so I tried to put something together: https://github.com/mikf/gallery-dl/commit/97f4f15ec0f15b6bca10bf33bf83f3fcd50c1e12. It should more or less behave like one would expect, but there are at least two things that might be better handled otherwise:

Edit: Never mind the points below, I thought about it and decided to change its behavior. It is no longer persistent and stores exactly the same log messages as shown on screen (https://github.com/mikf/gallery-dl/commit/c9a9664a65b4160857834355d234ffe46107559d).

Log files are currently persistent across gallery-dl invocations and can grow indefinitely. Other options would be deleting its old content each time or even log rotations (Python has built-in support for that as well).
Debug log messages will never be written to a log file, even when using '--verbose'. It would be easier to copy the --verbose output from a text file instead of a console window, but writing all of the debug output to an otherwise concise log file didn't seem like a good idea.

Hrxn commented 6 years ago

One small addition to logging behaviour:

Not entirely sure without a test case at hand now, but how's the current output for images on Tumblr relying on the fallback mechanism? I don't know if I remember that correctly, but it appeared that trying to download some specific image (e.g. some old images, old URL scheme etc.) with s3.amazonaws which then resulted in an error (403) prints the corresponding messages to the terminal, but the next successful download had the same Post ID in the name, so that would be the fallback URL used here. I hope you know what I mean, if not I'll try some random blogs again and wait for that error to appear and copy the messages here. Because the question is, should the error be printed to the output/logged to file in this case here? Because even if the "raw" URL does not work and we run into an actual 403, the fallback URL/mechanism still works and downloads that image instead successfully, so it's a bit debatable if that would really qualify as a real "error" 😄

Edit:

Had the opportunity to think about this again, and I'm not sure if it's actually worth to bother. Sure, it may be less than ideal (Think of your average end-user©® and the reaction: "OMG It says there's some error"), but this can be solved by simply, uh, explaining the stuff. And if this really warrants to deal with errors that are actually not so much of an error, by having to handle different error types, error classes, error codes and whatever, just for the sake of what, actually, consistency (?), I'm not really sure anymore.

mikf commented 6 years ago

I think it is actually worth to bother. With the way things were, it was impossible to tell if all files have been downloaded successfully by just looking at a log file, and error messages from an image with fallback URL were kind of misleading as well, since the image download in question did succeed in the end.

I added two more logging messages to hopefully remedy this (https://github.com/mikf/gallery-dl/commit/db7f04dd975af53e120cfe3151b88f0bc57371cf):

Failed to download <filename> when an image could not be downloaded, even when using fallback URLs.
Trying fallback URL #<number> to indicate that the last error message is not fatal.

Maybe it would be better to categorize all HTTP errors as warnings and only show the Failed to download … message as definite error?

Hrxn commented 6 years ago

Maybe it would be better to categorize all HTTP errors as warnings and only show the Failed to download … message as definite error?

Yeah, sounds good. 👍

rachmadaniHaryono commented 6 years ago

@mikf,

which extractor class from common module should i use to make CustomExtractor? is there any requirement for each extractor class?
can you explain the extractor class' attribute? (e.g. from BooruExtractor have 'basecategory', 'filename_fmt', 'api_url', 'per_page', 'page_start', 'page_limit', etc)
is there any rule on how metadata attribute on extractor class should be built, or is it up to each extractor?

mikf commented 6 years ago

Generally you should use the basic Extractor class, but, as always, it depends. There are some general extractor sub-classes (BooruExtractor, FoolslideExtractor, FoolfuukaExtractor, ...) and it might also be helpful to just copy an existing extractor module and adjust it to your needs. As for requirements: set the category, subcategory, filename_fmt, directory_fmt and pattern class attributes to some reasonable values (see, for example, slideshare.py).
category and subcategory are essentially an extractor's name and are used for config-lookup. directory_fmt and filename_fmt are default values for the directory and filename options. pattern is a list of regex-strings. An extractor is used if one of them matches the given URL. The resulting match-object is the second parameter to an extractor's __init__() method. basecategoryhas to do with shared config values, just ignore it.

The other attributes you listed are BooruExtractor-specific:
- api_url: URL to send API requests to
- per_page: number of post-entries per page
- page_start: the first page (0 or 1 depending on site)
- page_limit: largest valid page number
You kind of asked the same thing before. It is up to each extractor, but similar ones should use the same key-names. For image-metadata, you should always provide the filename extension as extension or at least set it to None.

rachmadaniHaryono commented 6 years ago

is there any rule on how metadata attribute on extractor class should be built, or is it up to each extractor?

You kind of asked the same thing before. It is up to each extractor, but similar ones should use the same key-names. For image-metadata, you should always provide the filename extension as extension or at least set it to None.

actually i got the wrong impression from chan.py. i thought that to use the match object later on items method it should be stored on class' attribute metadata. but after i look at slideshare.py, any name can be used (e.g. user and presentation)

here is what i can come up with https://gist.github.com/rachmadaniHaryono/e7d40fcc5b9cd6ecc1f9151c4f0f5d84

full code https://github.com/rachmadaniHaryono/RedditImageGrab/blob/master/redditdownload/api.py

this module will not download a file, but it will only extract from url

rachmadaniHaryono commented 6 years ago

@mikf can you give example for https://github.com/mikf/gallery-dl/commit/6a07e3836603407ee7fc17305f0ae7165b76f83c ?

mikf commented 6 years ago

from my_project import module_with_extractors
class SomeExtractor(Extractor):
    ...

from gallery_dl import extractor
extractor.add(SomeExtractor)
extractor.add_module(module_with_extractors)

You should use these functions instead of manually manipulating extractor._cache and relying on implementation details.

ChiChi32 commented 6 years ago

I'm doing something wrong? And i try option in config file, don't working.

2018-02-25_000317

Hrxn commented 6 years ago

Which version of gallery-dl is that? Can you run gallery-dl -v please?

Bfgeshka commented 6 years ago

Can we have percent-encoding conversions for saved files? I.e. replacing %20 in filename with whitespace, %22 with ", etc.

ChiChi32 commented 6 years ago

@Hrxn , I'm a bit embarrassed ... I found a strange thing. I have 2 folders, gallery_dl and gallery_dln. The first is the old version 1.1.2, the second is 1.2.1. Both are in the same directory. When I run any command using the bat file from the folder with the new version, the modules are taken from the old one. When I run -version from the 1.2.1 folder, 1.1.2 is displayed. I do not think that this is a problem program, rather Windous or Python. I apologize for the disturbance.

mikf commented 6 years ago

@ChiChi32 the __main__.py file expects to sit inside a directory named gallery_dl. In your specific case it adds F:\Python to its PYTHONPATH environment and then imports the gallery_dl package, which is the older 1.1.2 version. If you want to use multiple versions at the same time, you could try a directory-structure like

Python
|- gallery-dl-1.1.2
|  \- gallery_dl
|     |- __main__.py
|     |- ...
\- gallery-dl-1.2.1
   \- gallery_dl
      |- __main__.py
      |- ...

@Bfgeshka Sure, I think I'll add another conversion option for format strings to let users unquote the "offending" parts of a filename. These percent-encoding conversions (and similar) for each metadata-field are usually already handled as necessary. Where did you find something that hasn't been properly converted?

Bfgeshka commented 6 years ago

@mikf I encountered it in direct link download.

Hrxn commented 6 years ago

Some small thing I've noticed. Not a real issue deserving of a ticket, I presume. But still curious what it means, or what is the cause behind it.

PS E:\> gallery-dl.exe -v 'https://gfycat.com/distortedmemorableibizanhound'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://gfycat.com/distortedmemorableibizanhound'
[gfycat][debug] Using GfycatImageExtractor for 'https://gfycat.com/distortedmemorableibizanhound'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): gfycat.com
[urllib3.connectionpool][debug] https://gfycat.com:443 "GET /cajax/get/distortedmemorableibizanhound HTTP/1.1" 200 None
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): giant.gfycat.com
[urllib3.connectionpool][debug] https://giant.gfycat.com:443 "GET /DistortedMemorableIbizanhound.webm HTTP/1.1" 200 18240849
* E:\Transfer\INPUT\GLDL\Anims\\fluid in an invisible box750-720p DistortedMemorableIbizanhound.webm
PS E:\>

The download seems to work, just as it's apparent above. But the output is a bit different, not what I'm used to see, just observe this output path: E:\Transfer\INPUT\GLDL\Anims\\fluid in an invisible box750- [...]

The extra backslash, as if some directory is missing in between.

I post the configuration used here:

The general part (keywords, and keywords-default)

{
"base-directory": "E:\\Transfer\\INPUT\\GLDL",
"netrc": false,

"downloader":
{
    "part": true,
    "part-directory": null,
    "http":
    {
        "rate": null,
        "retries": 5,
        "timeout": 30,
        "verify": true
    }
},
"extractor":
{
    "keywords": {"bkey": "", "ckey": "", "tkey": "", "skey": "", "mkey": ""},
    "keywords-default": "",
    "archive": "E:\\Transfer\\INPUT\\GLDL\\_Archives\\gldl-archive-global.db",
    "skip": true,
    "sleep": 0,
[...]

Gfycat

    "gfycat":
    {
        "directory": ["Anims", "{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}"],
        "filename": "{title:?/ /}{gfyName}.{extension}",
        "format": "webm"
    },

But it does not happen here, for example:

Imgur

    "imgur":
    {
        "image":
        {
            "directory": ["{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}", "Images"],
            "filename": "{title:?/ /}{hash}.{extension}"
        },
        "album":
        {
            "directory": ["{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}", "Albums", "{album[title]:?/ /}{album[hash]}"],
            "filename": "{album[hash]}_{num:>03}_{hash}.{extension}"
        },
        "archive": "E:\\Transfer\\INPUT\\GLDL\\_Archives\\gldl-archive-imgur.db",
        "mp4": true
    },

Single Image:


PS E:\> gallery-dl.exe -v 'https://imgur.com/5m4CFZS'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://imgur.com/5m4CFZS'
[imgur][debug] Using ImgurImageExtractor for 'https://imgur.com/5m4CFZS'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): imgur.com
[urllib3.connectionpool][debug] https://imgur.com:443 "GET /5m4CFZS HTTP/1.1" 200 49800
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.imgur.com
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /5m4CFZS.png HTTP/1.1" 200 1940189

E:\Transfer\INPUT\GLDL\Images\Caley Rae Pavillard 5m4CFZS.png PS E:>

Album:

PS E:\> gallery-dl.exe -v 'https://imgur.com/a/jQxtc'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://imgur.com/a/jQxtc'
[imgur][debug] Using ImgurAlbumExtractor for 'https://imgur.com/a/jQxtc'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): imgur.com
[urllib3.connectionpool][debug] https://imgur.com:443 "GET /a/jQxtc/all HTTP/1.1" 200 62847
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.imgur.com
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /t9CD48N.jpg HTTP/1.1" 200 126079
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_001_t9CD48N.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /VoGBS4N.jpg HTTP/1.1" 200 148669
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_002_VoGBS4N.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /svbJXyy.jpg HTTP/1.1" 200 146013
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_003_svbJXyy.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /kDjvkrD.jpg HTTP/1.1" 200 130492
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_004_kDjvkrD.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /GxPVJSw.jpg HTTP/1.1" 200 154477
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_005_GxPVJSw.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /tUIUbSL.jpg HTTP/1.1" 200 194268
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_006_tUIUbSL.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /vcvv1r0.jpg HTTP/1.1" 200 193132
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_007_vcvv1r0.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /YBQddcB.jpg HTTP/1.1" 200 147301
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_008_YBQddcB.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /FkuxOXZ.jpg HTTP/1.1" 200 169420
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_009_FkuxOXZ.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /MB30wRC.jpg HTTP/1.1" 200 223108
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_010_MB30wRC.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /TGnsGoh.jpg HTTP/1.1" 200 147744
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_011_TGnsGoh.jpg
PS E:\>

Will add some more tests eventually, to see if I can get any different results with various input file options.

But so far, it seems to have something to do with "directory": ["{bkey}", .... ..., output path beginning with my custom keyword vs. "directory": ["Anims", "{bkey}",... ... output path starts with a fixed directory.

mikf commented 6 years ago

This happens because of os.path.join()'s behavior when using an empty string as the last argument:

>>> from os.path import join
>>> join("", "d1", "", "d2")
'd1/d2'
>>> join("", "d1", "", "d2", "")
'd1/d2/'

It adds a slash (or back-slash on Windows) to the end if the last argument is an empty string.

I've been using path = directory + separator + filename to build the final complete path with the assumption that all directories don't have a path-separator at the end, which, in your case, resulted in two of them ("...\Anims\" + "\" + "fluid in an invisible box...").

Hrxn commented 6 years ago

Ah, thanks. Makes sense. This behaviour of os.path.join() is again something which makes me wonder if such behavior is intentional, or if it is just some quirk. This time at least it's a quirk that affects all platforms in the same way, right? 😄

Edit: Ha, my mistake. It's probably intentional, I see where this could be useful.

Hrxn commented 6 years ago

BTW, everything works with the latest commit, example URL above is correct and did not encounter it anywhere else!

reversebreak commented 6 years ago

Just something quick I noticed - gallery-dl appears to be unable to handle certain emoji appearing in captions on tumblr (and maybe elsewhere??). (Warning - the post I was able to trigger this with this on is rather nsfw) Running --list-keywords on an offending post with --verbose and piping the error output with 2> to a file gets me

[gallery-dl][debug] Version 1.3.2 [gallery-dl][debug] Python 3.4.4 - Windows-7-6.1.7601-SP1 [gallery-dl][debug] requests 2.18.4 - urllib3 1.22 [gallery-dl][debug] Starting KeywordJob for 'http://aurahack18.tumblr.com/post/172338300565' [tumblr][debug] Using TumblrPostExtractor for 'http://aurahack18.tumblr.com/post/172338300565' [urllib3.connectionpool][debug] Starting new HTTPS connection (1): api.tumblr.com [urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/aurahack18.tumblr.com/info?api_key=O3hU2tMi5e4Qs5t3vezEi6L0qRORJ5y9oUpSGsrWu8iA3UCc3B HTTP/1.1" 200 371 [urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/aurahack18.tumblr.com/posts?reblog_info=true&id=172338300565&api_key=O3hU2tMi5e4Qs5t3vezEi6L0qRORJ5y9oUpSGsrWu8iA3UCc3B&offset=0&limit=50 HTTP/1.1" 200 1374 [tumblr][error] An unexpected error occurred: UnicodeEncodeError - 'cp932' codec can't encode character '\u2661' in position 3: illegal multibyte sequence. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues . [tumblr][debug] Traceback Traceback (most recent call last): File "E:\gallery-dl\gallery_dl\job.py", line 64, in run File "E:\gallery-dl\gallery_dl\job.py", line 117, in dispatch File "E:\gallery-dl\gallery_dl\job.py", line 131, in handle_urllist File "E:\gallery-dl\gallery_dl\job.py", line 236, in handle_url File "E:\gallery-dl\gallery_dl\job.py", line 279, in print_keywords UnicodeEncodeError: 'cp932' codec can't encode character '\u2661' in position 3: illegal multibyte sequence

This is using the executable download, btw.

Hrxn commented 6 years ago

You are running this via CMD.exe I presume?

What happens if you do this first in CMD: chcp 65001 And then run gallery-dl?

Edit:

Or try to use Powershell. I've moved completely to Powershell by now as well..

mikf commented 6 years ago

This is a more general problem with the interaction between Windows, the Python interpreter, Unicode, code pages and so on.

As @Hrxn mentioned, you should be able to work around this yourself by changing the default code page to UTF-8 via chcp 65001. Another way is to set the PYTHONIOENCODING environment variable to utf-8 before running gallery-dl:

E:\>set PYTHONIOENCODING=utf-8
E:\>py -3.4 -m gallery_dl -K http://aurahack18.tumblr.com/post/172338300565
...

Python 3.6 and above also doesn't have this problem (it implements PEP 528), so using this instead of the standalone exe might be another option.

I tried to implement a simple workaround in https://github.com/mikf/gallery-dl/commit/0381ae53184108d2c6985d7200244b402a5c3b55 by setting the default error handler for stdout and co. to replace all non-encodable characters with a question mark. Tested this on my Windows 7 VM with Python3.3 to 3.6 and it seems to work.

reversebreak commented 6 years ago

Thanks. I got Python 3.6, installed it with PIP as per instructions and it doesn't crash anymore. I put Python on my PATH and it's the same user experience as using the EXE anyways.

Now, I might be missing something, but is there any way to extract "gallery order" for use in the filenames from Tumblr?
Right now, I can point gallery-dl at a tumblr post, and it gets all the pictures fine - but sometimes that means for short two-to-ten page comic posts the files are downloaded out of order.
After doing some testing and poking around it seems that the order that the files display in a photo post isn't necessarily the same as the order of their filenames (or the 'name' parameter).

You can test this yourself by creating a photo post, uploading say four photos one at a time, then saving the post.
Go back and edit the post, and drag photo 2 before photo 1, and save it again.
If you use CTRL+RIGHTCLICK+"View Image" on each in turn your tabs should go to filenames based off of _o2, _o1, _o3, _o4.
If you point gallery-dl at it it'll download it just fine, but the end result will sort as _o1, _o2, _o3, _o4. This will show the gallery in the wrong order, which is terrible for comic posts. Unusually, gallery-dl seems to download them in _o2, _o1, _o3, _o4 order, (according to the on-screen status), but I can't see it exposing that order to the user in parameters anywhere.

In addition to this, I don't see a way to extract the 'core name' of a file for use in the extractor.*.filename parameter.
Tumblr filenames when downloaded without a filename parameter come out something like tumblr_fakeusername_12345678912o1.jpg However, using the filename parameter to add extra stuff to the filename means you can't get that clean end anymore.
The closest parameter is 'name', which comes out something like tumblr_12345678912o1_r1_1280 or similar, when all you really want is the 12345678912o1 that only gallery-dl's default naming scheme seems to get access to.

Hrxn commented 6 years ago

Right now, I can point gallery-dl at a tumblr post, and it gets all the pictures fine - but sometimes that means for short two-to-ten page comic posts the files are downloaded out of order. After doing some testing and poking around it seems that the order that the files display in a photo post isn't necessarily the same as the order of their filenames (or the 'name' parameter).

Yes, I know what you mean. This was never a problem for me so far, because I've only downloaded picture sets that are just, well, a set of pictures, apparently, so the order was actually not relevant. But I agree, it's entirely different for something like a comic strip.

The filenames in a set post do not reflect the displayed order of the elements, as you already said. You also stated the reason for this, if you make a picture post and upload some files, they get generated names in this order. But this can be rearranged now, changing the order of the displayed items. What happens is that the structure you see in the end (in HTML) has the rearranged order as done by the creator of the post, but the filenames keep being the same as they were at the upload.

If you point gallery-dl at it it'll download it just fine, but the end result will sort as _o1, _o2, _o3, _o4. This will show the gallery in the wrong order, which is terrible for comic posts.

I assume what you mean with end result here is the order of the actual downloaded files. Yes, that is how they are sorted by the filesystem, in "natural" order.

Unusually, gallery-dl seems to download them in _o2, _o1, _o3, _o4 order, (according to the on-screen status), but I can't see it exposing that order to the user in parameters anywhere.

Now this bit is really interesting. Because gallery-dl just takes what it gets from the API, and this seems to indicate that the API returns the single elements in the correct order, i.e. as rendered in a browser. This would be good, because I think this is something which could be fixed without jumping through any hoops. So gallery-dl can avoid to fetch the HTML for a post entry and extract the order from there..

In addition to this, I don't see a way to extract the 'core name' of a file for use in the extractor.*.filename parameter.

Not exactly sure what you mean here. The filename (standard) is this: https://github.com/mikf/gallery-dl/blob/ffc0c67701ee0902c86d27f334d51a132cc475e2/gallery_dl/extractor/tumblr.py#L54 Compared to your example at the end, the number in the filename is {id}, and that oX part is {offset}.

Small side note: This is a bit confusing within the source code, because "offset" is also used as the name for the parameter to retrieve the posts from the API.

reversebreak commented 6 years ago

Ah, yes I did mean the final order as sorted by the filesystem - since there's no way right now to get 'gallery order' from gallery-dl, then the only ordering the filesystem has to go off of is the filename with the oX at the end (the "offset", as you said).

Just as a note, I haven't done thorough testing on all cases of the reordered gallery, so I haven't proven the ordering comes out like that in all cases.
Assuming it does, then I expect it'd be pretty easy to put a counter on the loop that downloads photos in posts.

Ah, sorry, I was using several different posts for testing and got confused about the outputs. I didn't notice the ID parameter was being used for the default - I thought I was getting a short form of the name parameter at the end of the default filename.

mikf commented 6 years ago

Another assumption that turned out to be false ... I assumed the _o1, _o2, ... parts of image URLs were assigned by the order in which they are returned by Tumblr's API. Instead these "offsets" are determined by upload order and the API returns images as the blog owner has arranged them, which can be completely arbitrary.

Here is a test-post if you want a simple example for that: https://mikf123.tumblr.com/post/172687798174/photo-order-test . Images where uploaded 1 to 4 and I re-arranged them afterwards. gallery-dl will use "o1" for image number 4, which also has a o4 in its URL, and so on.

The offset value from gallery-dl is just a simple counter that goes up by 1 for every item returned, which would work if images always stayed in their original order and it also functions as a generic solution for audio and video files which don't have an oX part.

https://github.com/mikf/gallery-dl/commit/4a26ae32df4cadee82959ae390a815dd7685c07e adds an option which, if enabled, sorts photos of a photoset by the oX parts in their URL. This should assign the correct offset value for each image (except if one is deleted ...) and have them in the right order. (The default value is false for backward compatibility's sake)

Another thing that might be useful is the "middle-part" of a Tumblr filename, which is now available for format strings as hash (https://github.com/mikf/gallery-dl/commit/6b72be8ee6d66259c28367dc6d963ac6c9988fec). The post ID (and not something like this middle-part) is used in the default filename because of the order when viewing these files in a file manager or image viewer. Post IDs are nice increasing numbers while these hash values are basically random.

(... and if any of you have better names for sort or hash, please let me know. I'm terrible at naming things)

Hrxn commented 6 years ago

(... and if any of you have better names for sort or hash, please let me know. I'm terrible at naming things)

No, I think the names are fine. My remark was just that "offset" is used in two different contexts in tumblr.py, I wanted to just mention this as a possible caveat, to prevent misunderstandings. But both are used (and named) correctly, I'd say. The Tumblr API Doc explicitly mentions "offset" as the parameter name for the /posts endpoint. As for the other "offset" (from within a single post, not offset between multiple posts), I don't really know, this could just be named {object-..} [order|number|count], whatever, or just {order} 😄

And sorry for adding a bit of confusion myself here: The {id} used in the filename is created by gallery-dl, and that is the ID of the post, as mikf correctly pointed out. And this is what you want to use, because the order of downloaded files in the filesystem is now the same as the order of the posts on the blog as displayed via browser.

The filename used/generated by Tumblr itself is something entirely different, and that number (before the oN) is, as mentioned above, simply some sort of hash value based on the file. So @reversebreak can now use {hash} for the output of the files handled by gallery-dl, yes. But, to be honest, I don't know why someone would use this. I can't think of any possible use case right now. Because it is just a random hash value. Useful for dumping many files into the same directory, sure, but gallery-dl could already handle this before, because post ID + "object-offset" is already unique. But if @reversebreak wants to use it, sure, why not.

https://github.com/mikf/gallery-dl/blob/4a26ae32df4cadee82959ae390a815dd7685c07e/gallery_dl/extractor/tumblr.py#L98-L101

Do I get this here right? The default value being false, https://github.com/mikf/gallery-dl/blob/4a26ae32df4cadee82959ae390a815dd7685c07e/gallery_dl/extractor/tumblr.py#L66

unless some user explicitly sets this option, the sort function (together with _get_tumblr_offset() never gets used.

So, what this sort here actually does, is sorting the URLs to be processed in the order of the filename as generated by Tumblr ("filename-offset"), right?

So, in conclusion, without calling this function, gallery-dl already uses the order as returned by the API - unless I'm totally blind and miss some other change in that commit?

But that already is the order that we actually want, right? The description in configuration.rst:

+ Otherwise these photos will be returned in the order the blog owner + arranged them in.

I mean, yes, that is the preferred order we want. Again, I don't understand the use case? Why would you not want the order the blog owner arranged them in, and instead the order of the upload? I doubt that anyone does this upload in a specific order. They just dump some files on Tumblr, and they generate the "filename-offset", just to keep the filenames unique. But there is not any planned order here, or am I totally confused now? The planned order is the one as arranged by the blog owner, why would anyone want to use another order?

But the arranged order is obviously relevant, and @reversebreak correctly noticed this and made a good request here. I simply never noticed this before, although I've probably downloaded thousands of photosets. But apparently not one comic (or anything else where I immediately would notice the order), so this is why I did not stumble upon this yet.

Small additional note here: I think that I remember having both a browser open with a certain Tumblr blog and an image viewer at the same time, but the image viewer was set to sort/display by last modification date, so this would explain why I did not observe this behaviour yet. This would also be a quick workaround for @reversebreak , just display by modification date. But don't pin me down on this, basically just a recollection out of hazy memory.

mikf commented 6 years ago

I mean, yes, that is the preferred order we want. Again, I don't understand the use case? Why would you not want the order the blog owner arranged them in, and instead the order of the upload? ...

gallery-dl has always just returned images in the same order the API returned them in, which is also the same as the order the owner arranged them in, but apparently that is the wrong order for some comics:

This will show the gallery in the wrong order, which is terrible for comic posts.

The sort option (as well as the hash value) are basically an attempt to allow users to sort images in the "original" order, which could solve @reversebreak's problem.

Do I get this here right? The default value being false, unless some user explicitly sets this option, the sort function (together with _get_tumblr_offset() never gets used.

As I said: "backward compatibility". This will most likely change if this option actually sticks, but only on bigger version updates. It affects which image gets assigned which offset value and that is not something that should just be altered.

edit: That is the same reason I don't want to rename {offset} right now. It would break existing configs. I guess I could add a bit of redundancy and the same value twice: one as (deprecated) offset and another one as {num} or whatever

mikf commented 6 years ago

Also, I kind of missed this before:

Unusually, gallery-dl seems to download them in _o2, _o1, _o3, _o4 order, (according to the on-screen status)

That doesn't happen, ever. The oX values for default filenames are basically a loop counter, as you have called it before, and only increase.

Hrxn commented 6 years ago

That doesn't happen, ever. The oX values for default filenames are basically a loop counter, as you have called it before, and only increase.

Ah, okay. I guess that would explain this misunderstanding. It has no further significance that this item number being ultimately used by gallery-dl is just a loop counter, and that the filename produced here resembles (at least at this part) the filename format used by Tumblr is just coincidence, more or less. Or your choice to follow the naming convention at least partially..

edit: That is the same reason I don't want to rename {offset} right now. It would break existing configs. I guess I could add a bit of redundancy and the same value twice: one as (deprecated) offset and another one as {num} or whatever

Yes, that is a very good reason to keep it, obviously. I'm not even suggesting to change it at all, just a thought to mention this potential "pitfall", just in case. Preemptive concern... I'm starting to regret that I've even brought it up in the first place.. 😄

gallery-dl has always just returned images in the same order the API returned them in, which is also the same as the order the owner arranged them in, but apparently that is the wrong order for some comics

Glad to hear, because the API basically has always returned the correct order, and gallery-dl kept that. I think I see the culprit know, that this is some sort of edge case @reversebreak discovered where this all breaks down. By the way, at this point an example link to an affected post would be a good idea. Otherwise this can't be tested. I put the blame on Tumblr's theming capabilities. The display of the blog can be customized pretty freely with HTML, CSS, etc. so this might be the cause why the actual order of the displayed elements has been rearranged again here..

reversebreak commented 6 years ago

Apologies, all. I think I may have led us all on a wild goose chase here.

I think this all stems from my assumption that the oX tag at the end of the default filename that gallery-dl creates when downloading tumblr galleries was the same as the oX that appears at the end of the name parameter. So in my efforts to introduce words to the filenames and not just IDs, my filename parameter was initially something like {id}_{slug}_(name}.{extension} However, that means that the resulting files would come out reflecting the tumblr root filenames. As a result of that, it meant that the resulting on-disk galleries would be in the wrong order as they are naturally sorted by the filename. This also explains what I was seeing in the output logs with stuff coming out in _o2, _o1, _o3, _o4 order, as that reflects the tumblr name parameter.

I turned off all filename parameters, and just went with the default tonight to test it, and it actually comes out in the right order on-disk, even using a rearranged gallery as the source. The default filename comes out with o1 o2 o3 in the order they appear in the gallery, even though the source filenames are _o2, _o1, _o3.

When looking through -K to build the filename string I saw offset in that list - but there wasn't any indication that that meant "this is the index number of a picture in the gallery in a particular post".
I looked at the tumblr API and saw 'offset' there as a parameter to a search/filter and thought it was something to do with that.

So I guess in summary:

Tumblr images are named with an increasing numerical tag in upload order, o1 o2 o3 etc
Tumblr image galleries can be in an arbitrary order, completely independent of filename order
The Tumblr API returns images in gallery order, not filename order
gallery-dl default filenames are tagged with a label on the end o1 o2 o3 etc. This number is completely independent of the o1 o2 o3 that may be on the tumblr filename, and is instead dependent on the order of the images returned by the API. gallery-dl is actually working perfectly fine by default.
It is very much not obvious what the offset keyword does when using -K, leading to the only obvious sources of numbering for someone constructing an altered filename the name or filename keywords. offset sharing a name with a real tumblr API parameter doesn't help

So in reality, the only thing that needs fixing out of all this is making it clear what offset does to a casual user of the program. Either rename it to something like gallery_index or decorate the output from -K to provide an explanation for it (and maybe other parameters). Make it clear that if you want gallery order in your filenames that that's the parameter to include.

mikf commented 6 years ago

Thanks for investigating and reporting back; my apologies for the confusing offset name and default filename structure.

The intent was to somewhat mimic Tumblr's filenames by adding a oX at the end of every downloaded filename, which was supposed to correspond to the oX index assigned by Tumblr. Turns out this doesn't work, since images can be re-ordered, and just causes confusion. I will renameoffset and assign a better default filename format in 1.4.0 .

As for

decorate the output from -K to provide an explanation for it (and maybe other parameters)

It would be a major undertaking to add explanations for all keywords (if you start with one, you might as well explain them all). as their are just too many. These values are usually named in a self-explanatory way or come straight from a site's API results, which would then be explained by their API docs.

reversebreak commented 6 years ago

Thank you. :smiley: I only put in the decoration suggestion as an option - if you're going to change the name of the keyword, then that solves the problem too.

Now, as for something different: I'm trying to get the subset of posts from my mega list of URLs that are reblogs from another post. I try calling gallery-dl -i .\targets.txt -g --filter "reblogged==True" (in PowerShell) and I do get a list of URLs being output. Yet when I copy a couple of URLs from the output list, they're not reblogs at all! I'm not sure whether I'm using the wrong parameter, or whether I've done something wrong in the filter expression (having no experience with Python myself). Can you see what I'm doing wrong here?

mikf commented 6 years ago

Hmm, I'm not sure why this doesn't work. Your filter is fine and should only let images from reblogged posts through. (You could even shorten it to --filter "reblogged", but that is besides the point)

Information about a post being a reblog is, again, coming from Tumblr's API, although the reblogged parameter is generated by gallery-dl itself: https://github.com/mikf/gallery-dl/blob/a1fa4b43b07f09a56feb916e02a923522bf0ae20/gallery_dl/extractor/tumblr.py#L83-L86 reblogged is only set to True if the reblogged_from_id key is available, which should only happen for reblogs.

I tested this again on my few test posts and it works as it should:

# original post
$ gallery-dl -K https://mikf123.tumblr.com/post/167623548569/
...
reblogged
  False

# reblog of the above post
$ gallery-dl -K https://mikf123.tumblr.com/post/169341068404/
...
reblogged
  True
reblogged_from_id
  167623548569

Maybe you could post one of the posts/URLs which gets categorized as reblog even though it isn't?

reversebreak commented 6 years ago

This one is quite weird. Using http://aruurara.tumblr.com/post/172577442134 as an example, if you call gallery-dl http://aruurara.tumblr.com/post/172577442134 -K the response list will list it as having reblogged as false, as is expected and correct.

Put that same URL in a text file as the only contents, named say single.txt, and call gallery-dl -i .\single.txt -g --filter "reblogged==True" and you get blank output, as you expect.

However, using a file in which that URL is only one of the many URLs in that file, it shows up in the output!
For example, I've created a minimal testcase file multi.txt which contains two URLs, both of which respond as false for reblogged when queried individually. Get it at https://pastebin.com/PbAAMZJr Call gallery-dl -i .\multi.txt -g --filter "reblogged==True" and you get both URLs as output.

I'm not sure if there's a case where you have multiple input from a file and non-reblog posts don't show up - I haven't found it yet.

mikf commented 6 years ago

Oh, you are confusing the progress indicator with the downloaded files. When there are 2 or more URLs as input, gallery-dl by default prints the input URL to show its progress in its input-list. You can disable this feature in your config file or by using -o output.progress=false.

mikf commented 6 years ago

And before another misunderstanding happens: the Tumblr extractor provides multiple URLs for each image and when you print them with -g you get something like this:

https://s3.amazonaws.com/data.tumblr.com/4085165f7ead78cb63de964cf48fb6cd/tumblr_p6rrvu8UHo1s3sz4ho1_raw.png
| https://s3.amazonaws.com/data.tumblr.com/4085165f7ead78cb63de964cf48fb6cd/tumblr_p6rrvu8UHo1s3sz4ho1_500.png
| https://78.media.tumblr.com/4085165f7ead78cb63de964cf48fb6cd/tumblr_p6rrvu8UHo1s3sz4ho1_1280.png

Only the first one (without a | in front) is the "main" URL and the other two are its fallback, in case the main URL doesn't work.

reversebreak commented 6 years ago

Ah, I see.

I just saw that output whilst you wrote the second post.

Hmm, so the URLs it gives you aren't the URLs of the posts, it's the direct URLs of the images themselves... I guess I can't use -g for my purposes then.
I was trying to use the filters in combination with -g to get a list of the post URLs that are reblogs of something else.

Thanks for the info.

mikf commented 6 years ago

Try --filter "reblogged and print(post_url)"

reversebreak commented 6 years ago

Abusing side-effects for fun and profit! gallery-dl -i .\targets.txt -g --filter "reblogged and print(post_url)" -o output.progress=false works! Also supresses the image URL output for some reason!?!?? (Which is exactly what I wanted to happen, just not what I expected to happen)

mikf commented 6 years ago

Images URLs are suppressed because the filter expression always evaluates to False. Either reblogged is false and the whole thing is considered false without touching the print() part or reblogged is true and print() gets evaluated, which returns None which is also considered false.

reversebreak commented 6 years ago

That makes sense.

I'm hitting some sort of API limit, which seems odd as with the filtering I should be pulling down less than my real mass-download. I downloaded a few thousand pictures a couple of days ago, but after a few hundred requests using gallery-dl -i .\targets.txt -g --filter "reblogged and print(post_url)" -o output.progress=false I get [tumblr][error] {'meta': {'status': 429, 'msg': 'Limit Exceeded'}, 'response': [], 'errors': [{'title': 'Limit Exceeded', 'detail': 'It did not work'}]} repeatedly Does the gallery-dl API key have a global limit, or is it associated with my tumblr account? Should I apply for my own OAuth key? And how do I see how long I have to wait anyway?

Actually - now that I think about it, not having to download the image files in between requests may really increase the rate at which gallery-dl hits the server for information. There's a downloader.rate setting, but that's number of bytes a second, not a limit on query thoroughput...

In addition to that, the same post is being printed multiple times in the output (sometimes). Not sure why that's happening either.

mikf commented 6 years ago

Tumblr's API has a global, application-wide limit of 1000 API requests per hour / 5000 per day (There was already some discussion about this on another issue). Image downloads don't count towards the limit, only how often gallery-dl accesses the API endpoints to get its information. This happens before any filter can get applied, so filtering out some posts sadly doesn't accomplish anything in that regard.

Should I apply for my own OAuth key?

Yes, you should. Especially in your case, when you want to access a couple thousand single posts, each requiring 1 request. There are instructions here, if you need any.

And how do I see how long I have to wait anyway?

I just took another look at the response headers sent by Tumblr and there is actually some information there:

X-Ratelimit-Perday-Limit 5000
X-Ratelimit-Perday-Remaining 4997
X-Ratelimit-Perday-Reset 86256
X-Ratelimit-Perhour-Limit 1000
X-Ratelimit-Perhour-Remaining 997
X-Ratelimit-Perhour-Reset 3456

I guess I could add a sleep() call to wait for the hourly limit ...

In addition to that, the same post is being printed multiple times in the output (sometimes). Not sure why that's happening either.

Some posts have multiple images and the filter expression is evaluated for each of them, resulting in the same post-url printed once for each image. You could pipe the output through uniq, but that (or anything similar) is probably not available to you.

reversebreak commented 6 years ago

Right, I didn't realise I was stepping on all other gallery-dl user's toes.

I've registered a key myself, but apparently I'm using up all of my own allocation on a single run of this too. A sleep call when you're going to hit the limit would be great for anyone running a batch job.
Maybe if you're running off the default shared API keys, it should sleep in increasing increments each time, like how ethernet works?
I'm not sure how the API ratelimit 'refills', whether all at once after an hour or dripping a minute's worth at a time. Some way of determining what your X-Ratelimit-BLAH current values are would be great too. Right now there's no way to tell. On top of that, maybe add a generic message if an extractor is using the default key values that come with gallery-dl and the user hits a limit, reminding the user to get the appropriate API key for that extractor.

There's an equivalent pipeable command to uniq in PowerShell, so I'm OK fixing up the output afterwards. I might just use gVIM anyway.

reversebreak commented 6 years ago

I see you've added a sleep call to the newest release - thank you.

The only thing I'd recommend is altering the message it produces when the limit is hit - put in a reference to the current time in the error log call. That way if someone walks away from their computer and sees [tumblr][info] Hourly API rate limit exceeded; waiting 1000 seconds for rate limit reset upon coming back they don't wonder "1000 seconds from when?".

mikf commented 6 years ago

Done (8b79eaa). I've changed it so it shows the time when the limit will reset. Seems more useful than manually having to add 1000 seconds to the time the message was printed at.

jeremiahfallin commented 6 years ago

Is there a way to download by search terms instead of by user on deviantart? When I try "gallery-dl https://www.deviantart.com/popular-all-time/?q=alm+fire+emblem" I get "[gallery-dl][error] No suitable extractor found for 'https://deviantart.com/popular-all-time/?q=alm+fire+emblem'"

mikf commented 6 years ago

Downloading by search terms is currently not supported. DeviantArt provides an API endpoint for that, so implementing this should be quite easy.

Edit: done (ec15877)

jeremiahfallin commented 6 years ago

Would it be possible to implement downloading by search terms on pixiv as well?

mikf / gallery-dl

Questions, Feedback and Suggestions #2 #74

11 had gotten too big, took several seconds to load, and was closed as a result.