johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
922 stars 130 forks source link

Not an issue but rather a procedure and question #107

Closed tringate closed 7 years ago

tringate commented 7 years ago

Warning, very long post

I want to validate how I am using tumblrthree.

My needs result in run several instances of tumblrthree, not at the same time, but individually one at a time.

I have set the function of "Activate portable mode" in all instances. When I update an instance from release to release, I first delete everything in the root directory of the instance except the "Blogs" folder. I then copy in the proper language folder and the 12 program files into that folder from the new release. My next step is to launch tumblrthree, go into settings, check the "Activate portable mode", correct the file path for exporting blogs, and verify all other settings are correct. I then do an "authenticate" followed by a "save". I close the instance of tumblrthree and relaunch it checking to make sure I have not made any mistakes or omissions. If everything is good, I select all blogs add them to the queue and do a crawl.

This sequence has been working very well for me but I have a couple of questions.

(I have looked in the windows AppData folder for tumblrthree and verified those files have not changed since I switched to using the "portable" mode so I presume all files tumblrthree uses are now contained in the root instance folder and the index folder.

If the above is true, then it should be technically possible to run multiple instances of tumblrthree at the same time I would think.)

In looking at the format of the files in the "index" folder I see that the "ChildID" and"Location" both have a fully qualified path. I have been simply copying the two files pertaining to a blog from one instance index to another and moving the blog photo file folder to that instances blog directory to move a blog between instances. I had never looked at the files. Doing this appear to work just fine. I do not remember which blogs I have done this to recently so I can't go look to see if the software updated these two fields or not. I am wondering if I have caused an issue that is going to bite me in the future by doing this.

I am currently running release 4.55 and moving to 4.57 in 10 instances which handle well over a 100 blogs each.

I tried moving to release 6.8 but it locks up my computer after running through many of the blogs and freezes my entire computer with a solid HDD light on. The only way to get out of it is to press the reset button and reboot the computer.

I want to slowly move forward to release 6.14. It appears to maybe solve one issue I have with only one of the blogs out of 1,000 blogs I follow. It always stalls at file 73,352. The other crawl continues to process all the following blogs in that instance. I want to see if the "stall" fix in 6.14 takes care of this. I am thinking there is something wrong with the actual blog on Tumbler. I am able to remove the blog from the queue but that crawl stream is gone until I restart the program again.

Release 4.57 is being tested now and so far has been used on two of the 10 instances without any new issues. I am reading that release 4.59 will change the file name to be whatever I have selected for the highest image size I selected regardless of what image size it finds. That is going to give me a terrible problem by having the software rename a file differently than the name of the file it is downloading. I want to be sure I really am reading that comment correctly. Is that behavior is being carried forward in all releases from 4.59 onward? I will need to download the several million photos I have all over again because I need a consistent file name structure for all of the offline processing programs I run. I'm not sure why a program would alter the true file name from what the source file name was. In my mind that is almost intentional file name corruption. I would think the real file name would always be the file name of choice. I know this idea came about with the raw addition. I tested a download using raw of well over 150,000 photos. I discovered that nearly every one of them were identical to the size of an existing file of smaller file name size. I had two identical files one named raw and another with either a _500, or _1280 file name. If tumblrthree found any that were actually larger, it sure wasn't evident. It certainly was not worth the time to do further downloads. I presume if I select 1280 as my desired size the file name of the largest file found is what will result. I look at this option as selecting the largest file size available up to the file size selected. I would always expect the file name to match the file name it downloaded from tumbler. Have I misunderstood what this option really is intended to do? I certainly do not want every file downloaded to always be named my preference if only a smaller size is available.

I know this is a bunch of questions, with lots of other information added in, but I do not believe the way I am using tumblrthree is like most users would be using it. Others have expressed an interest in running multiple instances, maybe at the same time. My reason is more to implement the "group" idea of being able to download and save groups of blogs that are associated with the same photo topic and placing them in a specific folder grouping.

I use off-line programs to identify duplicates and delete them based on file date keeping only the very oldest posted version of a photo. Presumably it will be in the blog it was originally first posted, or one of the early reblogs if I do not download the blog where it was first posted. When tumblethree implemented the change in file creation date, I was able to reduce my photo collection by over 50%. That was an immense space saving and significant performance enhancement to my total process.

johanneszab commented 7 years ago

Would it actually be possible to run more than one instance at a time?

It is possible.

The limiting factor is the number of connections you open to the tumblr api. Since it's entirely controlled by Tumblr.com and 90 connections per 60 seconds seem to be a value where they don't close any connection, you'll have to split your amount of connections within your instances. For example, if you start TumblThree twice you'll have to set the connections to 45 in each instance, thus they open a maximum of 90 connections together during the crawl period.

It should still be quick enough depending on your connection since the download of the pictures/videos usually takes more time than the crawling for urls.

As an alternative you could try out the SVC release (e.g the v1.0.7.X releases). I've actually discovered this service during my implementation of the private blog downloader. It basically outputs even more data about the posts of a blog than the Tumblr api but seems not be limited. They possibly cannot even do this since their webpage depends. I've implemented most features there already, but you can untick the "Limit connections to the SVC server" or keep the limit at a values which fits your needs. E.g. you could leave it at 90 connections per 60 seconds in say 4 instances without being limited. You'll have to figure it out by your own.

Is it necessary to do the "Authentication" and does that persist for future runs of the same instance?

It's only necessary if you want to download private blogs that use the svc service. Without a valid cookie it doesn't respond. Thus, if you test the svc release you'll have to authenticate or if you use the normal/master release (v1.0.8.X) and want to download private bloogs.

It uses a wrapper around the internet explorer and the cookie is stored in the cache of the internet explorer. So, it should be persistant. If you want to delete the cookie, you can open the Internet Explorer and delete your cache there.

johanneszab commented 7 years ago

In looking at the format of the files in the "index" folder I see that the "ChildID" and"Location" both have a fully qualified path.

The path in the Index files doesn't matter much. It's only used internally during the run time and depends on your download location setting. As long as the _files file is in the same folder as the main index file it should work properly now IIRC.

Thus, if you move the file to a different folder and change your download location it should still work.

johanneszab commented 7 years ago

I first delete everything in the root directory of the instance except the "Blogs" folder.

I think you'd also have to delete the main settings file in the AppData folder (AppData\Local\TumblThree\Settings). If TumblThree doesn't find a valid settings file in the same folder as it is located (portable mode) it will automatically read the settings in the AppData. Thus, If you want to completely reset the settings, you'd have to delete both settings files.

In the next release I'll specify if the settings needs an update or don't. But I'll probably not implement any update or backward compatibility mechanisms since It doesn't make sense right now. Early in the development of an application things might change quite rapidly. I've also implemented some features I haven't foreseen, thus already adding backward compatibility code makes the whole development harder. It heavily blows up the code, makes it more complicated and then probably doesn't work at all.