hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.37k stars 157 forks source link

Tumblr parser is borked (because of redirect trickery) #1187

Open blipdrifter opened 2 years ago

blipdrifter commented 2 years ago

not sure if this is the right place to ask about this, also i apologize if this has been asked about before in another issue (and i didn't find it before now)

i use tumblr quite a bit and i want to import the images i liked into my hydrus client. however, most of the time whenever i try to shove a post url into any of my url downloader tabs, tumblr does a cheeky thing where it redirects the '.png' url to a HTML page that contains the actual png in a page that has a 'png' extension but in reality is just a html page. i think there's some referrer shenanigans going on here, but i'm not exactly sure if that's true or not. (here's an example of this, btw)

the built-in tumblr parser does not handle these redirects very well, showing 404s for images that use this trick. (not to mention some posts have images embedded in html tags within the json which don't even get parsed at all, but i think that's been brought up in another issue already.) i'm not sure if it would be possible to modify the parser to fix the redirect issue, but if it is then i would love to see an improved version of the parser in future hydrus versions if possible.

please let me know if you have any questions, and thanks in advance!

blipdrifter commented 2 years ago

if it helps, i've found a post by the gallery-dl dev explaning how they extract images from tumblr's new system (it involves going to the html page first to get the proper api key and then redirecting to the image url with the proper api key in the url, which i could be wrong but i don't think this is possible to do in hydrus yet): https://github.com/mikf/gallery-dl/issues/2957#issuecomment-1256385652

also, someone has made a tumblr proxy but nobody's hosting it, it has a tendency to miss images (according to the README), and it's more or less abandoned: https://github.com/heyLu/numblr

leakspin commented 1 year ago

Hey! Just happen to bump into this, something similar has happened to me and I have tried to fix it. I haven't tested too much, but for my examples (single image post), they have worked.

So, my problem is that now Tumblr has new Twitter-like URLs (https://www.tumblr.com/avakkins/696391628529827840/lucy-%E7%96%BE%E9%A2%A8-hayate%E3%83%8F%E3%83%A4%E3%83%86) and they are not compatible with traditional posts as the HTML is completely different.

I have created a new URL class to get this new URL style but I have converted it into the JSON API URL. Creating another URL class to process individual posts via JSON works as there already exists a JSON downloader for Tumblr posts.

And here is the export of the URL classes and parser! tumblr - downloader

Hopefully, I have explained myself correctly!

floogulinc commented 1 year ago

@leakspin I haven't looked at it yet but your redirect downloaded could probably just use the built in API redirect function in URL classes.

blipdrifter commented 1 year ago

Hey! Just happen to bump into this, something similar has happened to me and I have tried to fix it. I haven't tested too much, but for my examples (single image post), they have worked.

So, my problem is that now Tumblr has new Twitter-like URLs (https://www.tumblr.com/avakkins/696391628529827840/lucy-%E7%96%BE%E9%A2%A8-hayate%E3%83%8F%E3%83%A4%E3%83%86) and they are not compatible with traditional posts as the HTML is completely different.

I have created a new URL class to get this new URL style but I have converted it into the JSON API URL. Creating another URL class to process individual posts via JSON works as there already exists a JSON downloader for Tumblr posts.

And here is the export of the URL classes and parser! tumblr - downloader

Hopefully, I have explained myself correctly!

Tried this but didn't seem to work, even when using the newer subdomainless URL format. Again, the issue is with the image .png links redirecting to .html pages, which hydrus doesn't know what to do with. I would love to have a solution for this since a lot of the images I want to archive are on tumblr and having the tumblr downloader be kneecapped because of this issue is a pain in the butt.