Tumblr, not all images are downloaded

michaelx commented 6 years ago

First of all, great project! The Tumblr extractor seems to download only a limited amount of images.

E.g.

gallery-dl "http://wrapmagazine.tumblr.com/tagged/illustration"

gallery-dl gives me 78 images, while DiSiqueira/TumblrDownloader gives me all 123 images.

Hrxn commented 6 years ago

Just as I was tinkering around with Tumblr and thinking about opening a new issue.. 😄

Testing with https://api.tumblr.com/console/calls/blog/posts Gives me "total_posts": 150, for that tag (illustration). Not sure if 123 is really correct either.

The returned array for posts contains only 20 entries there, it seems. Not sure if that is just the Web API console or the response in general.

When I'm home I'll try it with this: https://github.com/tumblr/pytumblr / https://pypi.python.org/pypi/PyTumblr

Might depend on the post type..

PS: Another great project here on GitHub dealing with Tumblr: https://github.com/bbolli/tumblr-utils/blob/master/tumblr_backup.py

mikf commented 6 years ago

DiSiqueira/TumblrDownloader and gallery-dl are both using tumblr's old API, which only reports 123 posts (not images) in total. https://github.com/bbolli/tumblr-utils/blob/master/tumblr_backup.py actually downloaded more than 300 images for wrapmagazine/illustration.

It might be time to switch to the new API version ...

mikf commented 6 years ago

Tumblr API v2 is up and running and produces the same results as the old API, so the initial amount of 123 images appears to be correct.

It seems that using tags gives some pretty counter-intuitive results for "total_posts". Here are some numbers for http://wrapmagazine.tumblr.com - edit: for Type set to photo:

Tags	`total_posts`	Actual Posts	Images
none	123	123	169
`illustration`	150	85	123
`print`	5	2	3

Hrxn commented 6 years ago

How do you get to 123 posts in total without any tags set?

From https://api.tumblr.com/console/calls/blog/posts [1], I see .response.total_posts= 287

Additionally: .response.blog.posts = 287 Which seems to be the basic blog information, in general part of the API response, I assume.

Can also be obtained via: https://api.tumblr.com/console/calls/blog/info [2]

Besides, 150 posts for illustration, while 123 in total, does not really make much sense 😄

[1] https://www.tumblr.com/docs/en/api/v2#posts [2] https://www.tumblr.com/docs/en/api/v2#blog-info

As I understand it..

mikf commented 6 years ago

The numbers above are for posts with "Type" set to photo. Sorry for not explicitly mentioning that.

There are indeed 287 posts in total (.response.blog.posts), of which 123 are of type photo (.response.total_posts) and only those contain a photos object with information about actual images.

Applying the illustration tag changes .response.total_posts to 150, which doesn't make any sense, I agree on that. This number stays the same regardless of Type selected, which indicates that Tumblr only applies the Tag filter and disregards Type to get to this number.

There is also other "weird" or unexpected behavior when using tags: requesting for example posts 51 to 100 sometimes only gets you, lets say, 32 posts instead of the expected 50, even though there are more posts after that.

Hrxn commented 6 years ago

Applying the illustration tag changes .response.total_posts to 150, which doesn't make any sense [..]

Well, it does make sense, not accounting for type, 287 posts in total, of which 150 have the tag 'illustration'. If you use /posts only with a tag defined, I think total_posts and the actual number of posts returned from the API should match.

This number stays the same regardless of Type selected, which indicates that Tumblr only applies the Tag filter and disregards Type to get to this number.

Indeed. I see what you mean. I think this basically means that you can't do something like that SELECT * FROM posts WHERE type = 'x' AND tag = 'y'

Because the API simply does not support it (because of additional load?). It seems that only one property gets used, and apparently tag takes precedence.

Or to be more specific, maybe you actually can, but have to ignore total_posts because it is not longer accurate then.

I just tried tag = print, type = text, and it actually gives me 3 posts. From your table above, 2 post for type = photo, and this would land us at 5 of 5.

But okay, this is all not really the issue, I'd say, because relying on type = photo is kind of a red herring anyway. This was one of the primary reasons I've been thinking about opening a new issue for Tumblr enhancements lately, before @michaelx kinda beat me to it 😉

The crux is the way how Tumblr works, which is a bit needlessly complicated (others might argue it's flexible), I'd say. So I'm really not surprised that this discussion thread here exists 😄

The 'Make a post' functionality on your Tumblr Dashboard lets you pick between the seven types, but not all Blogs on Tumblr make the sensible choice to only use Photo (which can be a single photo post or a photo set) for images. Some users have the habit to use the Link feature, which automatically creates embedded images if used in conjunction with certain sites (I can definitely say Instagram, and I think Flickr as well, probably more). And there's of course the Text post, which lets you insert photos and even videos with the click of a single button (and the obligatory GIF search, obviously.) for more joyful inlined content. On top of that, you can do the same for Quote. So, basically, full HTML as the post body.

mikf commented 6 years ago

I think the last commit pretty much implements everything @Hrxn's last paragraph hints at (even if it took me far longer than it should have):

You can select which post types should be scanned for photo/audio/video files to download (internally it requests information about all posts and filters the unwanted ones out).
It can search post bodies for inline images.
It can follow links to external sites (mainly useful for "Link" posts).
Image and video URLs are transformed to their "raw" form (*)
- https://78.media.tumblr.com/ee589c6345f29d2d5935cecb49b0a705/tumblr_oztu02dIHp1wgha4yo1_1280.png -->
- http://data.tumblr.com/ee589c6345f29d2d5935cecb49b0a705/tumblr_oztu02dIHp1wgha4yo1_raw.png

By default everything should behave like it did before and only get images from "Photo" posts, but it is now possible to configure gallery-dl to get everything ... hopefully.

(*)

The SSL certificate of data.tumblr.com is only valid for amazonaws.com and is therefore considered invalid, which means raw URLs can't use HTTPS.
Roughly one third of all inline GIFs (and only those) yield a "403 Forbidden" when accessing them via their raw URL. Some work, some don't and I don't know why. Try gallery-dl -o posts=text,chat,link,audio,video -o inline=true --filter "extension == 'gif'" http://setheverman.tumblr.com/ if you want to test this yourself.
Even "raw" videos have been (post)processed by Tumblr are not the original files that where uploaded.

Some users have the habit to use the Link feature, which automatically creates embedded images if used in conjunction with certain sites (I can definitely say Instagram, and I think Flickr as well, probably more)

Tumblr even supports Danbooru and Pixiv, which is really not what I would have expected.

Hrxn commented 6 years ago

Hey, great news! Thanks a lot for this. And don't worry about the time it took, just do it however you feel about doing it, it's perfectly fine. 😄

It can follow links to external sites (mainly useful for "Link" posts).

Probably best used together with --write-unsupported?

Image and video URLs are transformed to their "raw" form (*)

Great idea!

The SSL certificate of data.tumblr.com is only valid for amazonaws.com and is therefore considered invalid, which means raw URLs can't use HTTPS.

Expected, doesn't work in the browser either.

Roughly one third of all inline GIFs (and only those) yield a "403 Forbidden" when accessing them via their raw URL.

Expected, I think. Probably caused by a set of "standard" GIFs on Tumblr, displayed in the editor interface etc. for quick access as "reaction GIFs", I presume. As far as I know, they still use an older URL address scheme. And I saw this is already fixed with https://github.com/mikf/gallery-dl/commit/b14de6ffc26bc329ecec54f0d24f8d0817376f1b, basically.

Even "raw" videos have been (post)processed by Tumblr are not the original files that where uploaded.

That is true. But this is usual behaviour, I'd say, not just for Tumblr. And it is still better than what youtube-dl does, for example, which doesn't use these 'raw' URLs and thus returns 720p at best.

mikf commented 6 years ago

Probably best used together with --write-unsupported?

For the most part yes, that is what it is being useful for, as most external links seem to point to youtube, instagram, vine, etc., but I have also found a user linking to his flickr images, which would then be downloaded using the flickr extractors.

Probably caused by a set of "standard" GIFs on Tumblr, displayed in the editor interface etc. for quick access as "reaction GIFs", I presume. As far as I know, they still use an older URL address scheme. And I saw this is already fixed with b14de6f, basically.

I don't think this has necessarily something to do with "standard" GIFs, especially when looking at what kind of GIFs are affected by this, but then again I don't use Tumblr myself.

What I've figured out so far is that all GIF URLs end in either _raw.gif or _500.gif when using the data.tumblr.com variant (raw, 500) and I had hoped to find some way of determining which one it is other than sending a HEAD request and looking at the status code, but maybe that would be good enough. https://github.com/mikf/gallery-dl/commit/b14de6ffc26bc329ecec54f0d24f8d0817376f1b circumvents the problem, but it causes GIFs that exceed Tumblr's filesize limit to only consist of 1 frame: normal, raw

There are even some audio files which have similar problems (403 Forbidden, infinite redirect) and there is nothing that can be done about that.

mikf / gallery-dl

Tumblr, not all images are downloaded #48