johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
920 stars 130 forks source link

[Feature Request] Download theme/css, mirror website look, create browsable content #296

Open ZaCloud opened 5 years ago

ZaCloud commented 5 years ago

Hello. Is this how the program is supposed to function: It only downloads media (sound clips, images, etc), and not the posts themselves? In the directory I chose for it to download to, all there is is media (and it seems to cut off, not including anything under a "Read More"), and no way to view the actual posts (.html or any such files). Unless there's supposed to be a way to open .tumblr files? Am I doing things wrong or is this by design?

If this IS by design, then consider the ability to open the blog itself (even without stylesheets), and the ability to download content under a "Read More" link, as a feature suggestion. But if not, then sorry I'm a noob, lol. Thanks.

johanneszab commented 5 years ago

Turn on the Download *** meta and/or Download *** post-options in the details pane. You might want to change the metadata format to json if you want to parse it further.

If that's still not enough, enable the dump crawler data-option.

Everything under the Details pane (on the right side of the application, after selecting a blog).

MrEldritch commented 5 years ago

Hmm... given just how much data is in that "dump crawler data" dump, I'm wondering how difficult it would be to put together a bare-bones viewable-as-blog skin like tumblr-utils does. The JSON from the crawler dump contains the html for each individual post, so it should "just"* be a matter of stringing them together, swapping out the Tumblr image URLs for the locally-downloaded image files, and applying some default backup CSS.

ZaCloud commented 5 years ago

Turn on the Download *** meta and/or Download *** post-options in the details pane. You might want to change the metadata format to json if you want to parse it further.

If that's still not enough, enable the dump crawler data-option.

Everything under the Details pane (on the right side of the application, after selecting a blog).

Thanks, but it still didn't work. The only change that adding the 'meta' options did, was adding .txt files containing copy/pastes of the text portions of posts, questions/answers, url text, etc each in their own respective .txt files. There's still no way to open the posts themselves including the images in context. No .html/.xml/.pdf or any such files that reconstruct the posts with the media. Just pictures and .txt files. Changing to JSON format did nothing new either.

tumbl3 not working right

MrEldritch commented 5 years ago

ZaCloud, if you turn on "Dump Crawler Data", then each post will also be saved as its own .json, which carries a very large amount of metadata - including the HTML for the post!

ZaCloud commented 5 years ago

Well, I don't know what I'm supposed to do with a huge pile of individual .json files full of code. This still doesn't get me any closer to having a simplified replication of opening a tumblr blog in an easily readable format. :/

MrEldritch commented 5 years ago

Oh, sorry, I misunderstood. ZaCloud, currently TumblThree doesn't have that functionality. tumblr-utils, however, can do pretty much exactly what you're asking (although it's got its own shortcomings).

johanneszab commented 5 years ago

Hello. Is this how the program is supposed to function: It only downloads media (sound clips, images, etc), and not the posts themselves?

First: It obviously does download the actual posts. Like you say yourself, in text or json format. It's just not in your wanted format.

I was never interested in mirroring the exact tumblr website structure, nor the theme. I simply didn't see the gain in opening the posts in this bloaty, heavy java script site.

There probably already is an open issue/request for mirroring the theme/css/website. Since no one was interested in implementing it, it' not here. But as @MrEldritch said, most of the parts necessary are already filtered and somewhere in the code. Someone (still) has to implement the theme/css grabbing and path redirecting parts.

ZaCloud commented 5 years ago

Ahh, I see. Well thank you everyone. @johanneszab , as @MrEldritch pointed out, tumblr-utils does indeed make the blogs viewable, with a slim, simplified format that doesn't emulate the bloated themes, and that's honestly fine. While I was able to get tumblr-utils to do what I wanted, I'm sure many are a bit intimidated by the thought of using command lines, so it'd still be nice to see this utility have a similar capability.

The remaining problem now is that media under a "Read More" seems to still not be downloaded, and I'm sure that'd also be a feature of interest for most.

MrEldritch commented 5 years ago

Yeah; I concur - I would also be extremely interested in a way to download your blog in a blog format, but I do not care about theme/css mirroring - the minimal, simplified form that tumblr-utils mirrors blogs in would be entirely sufficient. (In fact, given how many blogs are actually quite painful to read in their own theme, css mirroring might actually be a downside)

MrEldritch commented 5 years ago

Honestly, I'd be happy to help with this myself, but I don't know .NET / C#. I already spent a few hours trying to do the opposite - figure out how TumblThree accessed hidden blogs and see if I could replace tumblr-utils' crawler with it, because tumblr-utils is Python and I do know that - but I just couldn't quite get the trick with the cookies to work. And it's clear that TumblThree is just a much more powerful crawler, in general, than tumblr-utils; the trick is just in the final reprocessing step.

Still .... tumblr-utils is only a thousand lines of Python, and much of that is for the scraping and json processing that TumblThree already does. The actual meat might not be that complicated, maybe simple enough that I could actually figure out how to do it and quickly learn enough .NET to integrate it.

santa-man commented 5 years ago

@MrEldritch @ZaCloud

I made a script that converts files downloaded by TumblThree into html files.

You can find the script here: tumblr_generate_html_files

It would be great to include this functionality in the existing program but for now this does the trick.

johanneszab commented 5 years ago

Thanks a lot, @santa-man!

I'll take a look at it after the holidays, and maybe we can together integrate this/something similar into TumblThree, in case of you are interested and a functionality like this is still needed. Well, maybe even Tumblr recognizes in a few weeks from now that they made a mistake, or things aren't going to be as bad as people think ..