johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
922 stars 133 forks source link

[Question] Discussion and Questions #112

Open johanneszab opened 7 years ago

johanneszab commented 7 years ago

For non-issue related questions, please ask here instead of creating new issues.

Taranchuk commented 7 years ago

Thank you for the thread! In the settings, I'm a bit confused that there are several connection settings, I would like to understand exactly what they are. The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true? Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api. I guess that the settings for the parallel connection determine the number of streams of downloading files by the links. A connection to api is about how to get these links from Tumblr api. And if all is true, then what is scan connection, the value of which is set to 4? Does this function somehow relate to the connection to the Tumblr api? I tried to put this value on 1000 and start downloading blogs (specifically the image metadata) and did not notice the error "Limit Exceeded", but also did not notice any apparent increase in download speed. Will it be better to leave this value at 1000 or better to return to 4? Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program? Will it be better for speed performance if I increase the time in the setting of the time interval for Tumblr api? Sometimes I just notice that the program does not download some part of the metadata, and after this did not see any error Limit Exceeded, perhaps because of the timeout of the connection to the api.

ghost commented 7 years ago

Hello.

I was wondering where can I get the .exe file of the latest release. Unfortunately, I don't have VS2015 or higher, but I wanted to test out the app.

johanneszab commented 7 years ago

@AnryCryman: Under releases, download the latest release. Currently that is v1.0.8.4, so the right file is TumblThree-v1.0.8.4.zip

ghost commented 7 years ago

@johanneszab Yeah, I downloaded it. But there are no executables there, only source code. Can you possibly email me the .exe file of the latest release to anrycryman@gmail.com?

johanneszab commented 7 years ago

..

I've uploaded that file myself and I'm pretty sure that there is a file called TumblThree/TumblThree.exe in that particular zip file. I cannot send you the .exe itself since it needs some more .dlls which are included in the zip file. Thus, I'd have to send you the exact same file I've linked above.

Why do you download the file called source code if you don't want the source code? Since I've already received five similar emails there must be a reason. Should I rename the link to binary? Did you download the source code .zip file from the main page by pressing the green download or clone-button?

ghost commented 7 years ago

@johanneszab Sorry, my mistake. Must have hit the wrong link. I downloaded TumblThree-v1.0.8.4.zip and found TumblThree.exe in there. Is there a way to explicitly specify the language of the app?

johanneszab commented 7 years ago

@Taranchuk:

The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true?

No, there will be 20 streams opened to the Tumblr servers. Actually it was more hard coded at the beginning. Right now it checks the current amount of active blogs give each active blog it slice of downloads. Thus, if you have the parallel connections setting set to 20 but only one blog in the queue active, it will consume all 20 connections. If you have 2 active blogs, they both will get 10 streams. It's probably a bit wonky but should work most of the time.

Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api.

I've decoupled the scan/crawler connection at some point from the above settings. The Tumblr api/svc service/the parsing of the website usually is quite quick since it's only a few KB text. In the beginning of TumblThree the crawler was started first and after it finished the downloader started. Thus it made sense to allow more connections for parsing the website and grabbing the urls as for downloading the heavy binary data.

Right now the values are superfluous because of two reasons: 1) The downloader starts immediately after the crawler dropped the first image/video/metadata url in the queue. So the waiting time until the first actuall download starts is mostly neglectable now. 2) The Tumblr api is rate limited now. This means they only allow a specified number of connections to the api per a specific time period. Thus, even if you increase the scan connections but have the "Limit the scan connections to the Tumblr api" checkbox ticked, the connections are queued until a free slot is available. Thus, it bascially makes no difference since the rate limiter is the limiting factor. 3) If you use the SVC Release or the parsing release however, you can increase or turn off the "Limit the scan connections to the SVC Service". I've discovered the svc service during my implementation of the private blog downloader. It basically outputs even more data about the posts of a blog than the Tumblr api but seems not be limited. They possibly cannot even do this since their webpage depends on it. I've implemented most features in that branch already. You'll have to try, I don't know if they'll eventually limit it (if abused).

johanneszab commented 7 years ago

Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program?

Exactly.

johanneszab commented 7 years ago

Also take a look into #107 for some more program details.

shakeyourbunny commented 7 years ago

Please redesign the whole UI to sane level, where it conforms to common expectations:

Kvothe1970 commented 7 years ago

I wonder: Would it be possible to be able to auto upload files to queue them? As in point the app to a folder / folders and provide a text file with a tag or make it configurable. Have the app then process the folder, upload and queue the images as per setting (one at a time, two, three, four) add the tag etc. This would TumblrThree even more than an amazing backup tool.

johanneszab commented 7 years ago

@Kvothe1970: Nice idea. Should be possible yes. There already is a filesystem monitor api in C#, thus implementing this should be more or less straight forward.

Maybe it's a good thing to also implement TumblThree GUI-less at the same time and let it start it from the command line. That might reduce resource usage.

Kvothe1970 commented 7 years ago

@johanneszab Considering I am a big fan of GUIs I would support this being optional ;)

Emphasia commented 7 years ago

Can I download my own liked photos and videos? I tried "liked/by/myaccount" but it just shows:"Request denied.You do not have permission to access this page."

Taranchuk commented 7 years ago

1) What is this file *_files.tumblrtagsearch, which lies in the folders from the tag downloader? I looked inside and it turned out that the filenames are stored here. But what's the matter: the filenames there are a few hundred more than there is really in the folder. Why can this be so and what does it affect? It may be that there are skipped files that were not downloaded the first time and the program can not download them again because these files are already on the list just like with index files? 2) Are the search and tag downloaders associated with the tumblr api? Is it possible to disable the api limit in the settings and run several instances for downloading by tag and search keywords without the risk of not getting the some files that will be skipped during the download? I have already tried this and have not yet encountered a limit error, but I would not like to think that it is possible that the limit detection monitor is simply not built into these functions and I just do not notice the missing files. 3) Is it possible to get some metadata from the search and tag pages? It seems to me that it's impossible to completely get them, but is it possible to get a full list of blogs on these pages, where were the images downloaded from? It was very much like one day to see the function of downloading a list of blogs from search and tag pages so that can select the most frequently blogs and add them to the program because perhaps they contain good content if they offer a lot of content that I'm interested in search and tag pages.

johanneszab commented 7 years ago
  1. Only the regular Tumblr blog downloader in the 1.0.8.X releases use the Tumblr api. The search downloader parse the regular website, so you should be able to run multiple instances without any problems. The SVC release (1.0.7.X) and the downloader for private blogs in the normal release (1.0.8.X) use a web service that is required by the browser to display the website itself. So it might eventually be rate limited it, but I don't think so.
douww2000 commented 7 years ago

Thank you for the great tool! One question, when I first time run the application, in Details panel I can see the textboxes for download time (from ~ to), but when I choose a blog, these textboxes disappeared, is this a function in progress or I used it in a wrong way?

Taranchuk commented 7 years ago

douww2000, this is for tag pages only. If you need to partially download blogs, use the function of downloading pages. For example, if you only need 1000 last posts and you have 50 posts per page by default in the detail views, then set the interval to 1-20 in the field "Download pages:". Or 1-1000, if you set 1 posts per page.

johanneszab commented 7 years ago

Well, not entirely right. I've included it in the release notes (v1.0.8.18) because downloading posts in a defined time span is possible for (private) blog downloads too.

So, I guess you'll have to update to the latest version.

PonyGirl6763 commented 7 years ago

I can't get into any private blogs. I went to settings and successfully authenticated with my Tumblr login credentials, but none of the private blogs I want to back up will download? I've attempted both on a friend's private blog and my own private blog and neither will work. Am I missing a step?

johanneszab commented 7 years ago

Am I missing a step?

Yes. Describing exactly what happens if you try to download a private blog. What do you see in the queue progress? What happens with the blog, does it just finish or hang? What did you select in the Details window for the private blog? Any tags? And maybe posting the url here, so that someone can check if it actually works.

Since you aren't the first person reporting this (see #118 for more), there might be something missing, but the blog posted there actually worked for me. Thus, I cannot do anything, since I cannot reproduce the error.

Of course, you could also debug the code/error yourself, if there is one after all ..

johanneszab commented 7 years ago

Ok, that won't work right now since you need to password to view your blog.

What I meant with a private blog is a blog like this: https://privtumbl.tumblr.com/ where you need to be logged in in order to see them.

It's probably possible to implement something that it will work with password protected blogs too, but it's not possible right now. How are these things called? it's weird tho since they are called differently all over the place. At least the last time I've looked.

johanneszab commented 7 years ago

hmm, it's way easier than I though. You just have to do an additional POST request with the password in the body before browsing the blogs, that's it. All the other code can be reused I guess.

Looks easy to do, but I don't have any time for this right now.

tumblr_passwordprotected2

johanneszab commented 7 years ago

Okay, one thing that works in the meantime is that you just update the cookie with the Internet Explorer. TumblThree uses the Internet Explorer to login to tumblr.com. The Internet Explorer is just opened in a different window. Thus, they share the same cookie.

To download your password-protected blog, try the following:

johanneszab commented 7 years ago

@PonyGirl6763: I think this will work for you (downloads password protected blog). You'll have to supply the blogs password in the Details tab:

TumblThree-v1.0.8.41-Application.zip

johanneszab commented 7 years ago

Something like a hidden (login required) and password protected blog does not exist? I can set both options in my second tumblr blog, but then it's impossible to access it from another account?

After I login with a second account, i always get a 404 page (this tumblr does not exist) without seeing any password request page at all.

AriinPHD commented 6 years ago

Hi @johanneszab, thanks for this questions-threat (and for a really amazing good app)! :) I have a question about the "Downloaded Files" vs "Number of Downloads". Most of the time these numbers don't match up; why is that? What do the numbers represent? I first assumed that "Number of Downloads" was the total number of downloads available after filtering (settings) and that "Downloaded Files" was a way to confirm that you had scraped 100% of the available gallery, but seeing the inconsistency makes me realize they are not working like that at all. Can you please explain? :) tumblthree_downloaded-items_2018-02-26_13-56-21 Screenshot: Both blog crawls are complete.

johanneszab commented 6 years ago

Previously it was implemented differently, but the number of the download is the number of downloads (posts, videos, images, external images/videos) TumblThree detected during the current crawl with your given settings, yap. Thus, it's not the total number of possible downloads, nor the total number of posts of the blog. Previously I've tried to calculate the total number, but it's never really consistent.

As I've just mentioned, the number of posts can be lower than the downloads if the blog contains a picture set as TumblThree will download all pictures from that set. Or there is an embedded picture withing a post. Someone deletes things from the blog, then the Downloaded Files will be higher than the Number of Downloads. It just was never really right, and people kept complaining, thus I've changed it to the current behavior.

It should be (almost) complete in your case if you download the whole blog at once, yes. But some urls TumblThree grabs aren't accessible on the Tumblr servers anymore. I've seen a few cases (pictures), and I'm sure those images are the reason for the lower count. They just return a 403 error code. I cannot give you an example right now though.

So, it's more or less a rough estimate.

AriinPHD commented 6 years ago

I see, great response @johanneszab thanks! :)

I'll keep that in mind and, I assume, I can safely ignore the numbers and use them only as an estimate of amounts/size. :)

Thanks again!

Hrxn commented 6 years ago

A short question: How does TumblThree determine its "duplicate found" elements?

I have tried the program with a single blog (with a rather big post count) and the number of found duplicates seems a bit high to me. Although that's just a guess, I admit. But given that each post on a blog has its own unique post ID, it can't be the posts itself, or am I mistaken?

johanneszab commented 6 years ago

How does TumblThree determine its "duplicate found" elements?

It simply count the occurrences of a generic data type that show up in the download queue. The queue is filled from the website/api crawler tasks and are emptied from the downloader task. For photos/videos/audio files, it's based on the url. For text posts, based on the post id.

Hrxn commented 6 years ago

@johanneszab Ah, okay. But the url is unique for media files? As opposed to post IDs?

dsteiger commented 5 years ago

Is it possible to download the items in Drafts? This URL: https://www.tumblr.com/blog/yourblognamehere/drafts

tehgarra commented 5 years ago

is it possible to get an outout of the errors at the top of the window since there's no way to scroll or wraparound?

apparently some of the _files.tumblr files disappeared when i updated, and i don't know what all of them are due to limited frame where it displays. the tooltip displays "Serialization Error"

johanneszab commented 5 years ago

@dsteiger: Currently. no. Possible, I don't know. I've never looked into that.

johanneszab commented 5 years ago

@tehgarra: does it help to move the mouse cursor on the error? If the information isn't in the tool tip, then, no.

tehgarra commented 5 years ago

@johanneszab it doesn't. i'll just try matching matching everything and see which ones have the _file.tumblr files missing and redownload from there. i think that's my best option. in context i transfered everything from one harddrive to another and haven't had any problems so far, but then i updated the version and came across that error, that's all.

i'll try deleting the child id for the directory and see if that works

imo the only real issue is the entry being deleted off of the list of blogs

johanneszab commented 5 years ago

You can do a "blog url" export in the settings somewhere once you've loaded all your blogs again.

With that file, if you open it in an editor, you can simply re-add all missing blogs by just copy them all to the clipboard (ctrl-a, ctrl-c) and letting the clipboard monitor add the missing blogs.

dsteiger commented 5 years ago

@dsteiger: Currently. no. Possible, I don't know. I've never looked into that.

That's a shame, but understandable. And probably too complex to implement soon.
But it would've been useful to those migrating by Dec 17, 2018. Many have more drafts than they can post by that deadline.

Thanks for the quick reply!

britindc commented 5 years ago

Sorry for the newb question, but I haven't found an answer after reading the wiki or (briefly) looking through submissions:

Is there a way, after fully downloading a blog, to go back a week later and have TumblThree update it to include posts that have been posted to that blog since then?

Great program by the way; I was happy to "donate a beer" :-)

tehgarra commented 5 years ago

Sorry for the newb question, but I haven't found an answer after reading the wiki or (briefly) looking through submissions:

Is there a way, after fully downloading a blog, to go back a week later and have TumblThree update it to include posts that have been posted to that blog since then?

Great program by the way; I was happy to "donate a beer" :-)

i usually just put it back in the crawl queue and it starts downloading the new images. haven't had any issues at all with it so far. sometimes i'll do a rescan but that isn't always necessary imo unless you had files moved and wanted to redownload them

tehgarra commented 5 years ago

@johanneszab does tumblthree have the ability to delete files? the reason i ask was because i downloaded files from a blog, and ended up having to redownload the blog again and the filecount in the foulder decreased noticable from before I redownloaded. Could that possibly be duplicate removal?

bepis commented 5 years ago

In what file are the blog URLs stored? Because I deleted a blog folder but it remains in the program's list of URLs. It gives an error when I shift-delete it.

tehgarra commented 5 years ago

The index folder with the .tumblr files @bepis

On Dec 10, 2018 9:50 PM, bepis notifications@github.com wrote:

In what file are the blog URLs stored? Because I deleted a blog folder but it remains in the program's list of URLs. It gives an error when I shift-delete it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/johanneszab/TumblThree/issues/112#issuecomment-446053390, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APG2w_BezBIj_cbG2or8A095R8fF4crBks5u3x19gaJpZM4OdUCk.

tommynomad commented 5 years ago

Hallo Johannes & thanks for the app & the opportunity to ask Qs.

I am using basic settings to d/l my own blog. I have managed to grab only a fraction of my posts, however. I hope you can see what I cannot, and advise me.

t3
johanneszab commented 5 years ago

Hi @tommynomad:

Thanks a lot for posting your question in this particular thread! And sorry for the late reply, but Tumblr should have picked a different day for this crap they are pulling, and not just before christmas, ...

Any chance you didn't tick the "download reblogged posts" checkbox further down in your appended screenshot?

tumblthree_nomadicpassions

I know, I should have reversed that logic and instead is should be a "download only original posts"-checkbox, thus the default would be to grab everything (#332).

tommynomad commented 5 years ago

Dear Johannes,

Thank you for the reply.

I made sure to also check that box.

On Saturday, 15 December 2018, Johannes Meyer zum Alten Borgloh < notifications@github.com> wrote:

Hi @tommynomad https://github.com/tommynomad:

Thanks a lot for posting your question in this particular thread! And sorry for the late reply, but Tumblr should have picked a different day for this crap they are pulling, and not just before christmas, ...

Any chance you didn't tick the "download reblogged posts" checkbox further down in your appended screenshot?

[image: tumblthree_nomadicpassions] https://user-images.githubusercontent.com/1684561/50043254-1bd8ac80-0071-11e9-8ebb-4843dc8b422e.png

I know, I should have reversed that logic and instead is should be a "download only original posts"-checkbox, thus the default would be to grab everything (#332 https://github.com/johanneszab/TumblThree/issues/332).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/johanneszab/TumblThree/issues/112#issuecomment-447566625, or mute the thread https://github.com/notifications/unsubscribe-auth/ArvVkt8kaKUO5CjE-0HNy9cot3h9O187ks5u5PHagaJpZM4OdUCk .

johanneszab commented 5 years ago

Well, for me it seems to work. Maybe you can try to re-add this blog. Close TumblThree for this, and remove the corresponding index files in the Blogs\Index\ folder (e.g. nomadicpassions.tumblr and nomadicpassions_files.tumblr).

tommynomad commented 5 years ago

That worked. Thank you Johannes.

peace and love, Tommy

"I don't need fame and I don't need power and I don't need wealth. I'm in need of friends, which I have found in abundance."-- Utah Phillips

On Sat, Dec 15, 2018, at 10:27 PM, Johannes Meyer zum Alten Borgloh wrote:> Well, for me it seems to work. Maybe you can try to re-add this blog.

Close TumblThree for this, and remove the corresponding index files in the Blogs\Index\ folder (e.g. nomadicpassions.tumblr> and nomadicpassions_files.tumblr).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2]. Links:

  1. https://github.com/johanneszab/TumblThree/issues/112#issuecomment-447568430
  2. https://github.com/notifications/unsubscribe-auth/ArvVkrt9zuAcat_vLrvvGfiSdKqKOmhHks5u5PjfgaJpZM4OdUCk
MustangWestern commented 5 years ago

@johanneszab Probably too late for this, and I saw a question similar to it above but I didnt fully understand it. So I set up the app to download only the liked posts (I think), but the numbers are a bit screwy. According to tumblr, I have 2692 liked posts, but my numbers on the app currently look like this pic. Should it download properly? Regardless, great app and thanks for helping everyone above!

tumblethreee