[test] Help testing the upcoming release!

johanneszab commented 6 years ago

Hi all.

I've refactored the code a bit and since the current release is rather stable I though I'd release a pre-production release via the "Issues" with the hope that some brave people are willing to test it. All feedback is welcome.

I've fixed some issues:

TumblThree now includes a working detection between hidden and offline blogs that doesn't require queuing the blog.
Fixes a bug that released the video connection semaphore too often. That means the slider in the settings for limiting the video downloads didn't work at all. Now it should properly limit the connections to the vt.tumblr.com host and prevent incomplete video downloads.

New features are:

There is now an option for regular and hidden blog downloads to dump all the data gathered by the crawler to the disk. Right now for each post a separate file is created. I wasn't sure if I should write it all in one single file per blog (could get large rather quickly) or split it. It completely obsoletes the current metadata download options since it contains more information. The SVC branch (v1.0.7.x) contains way more crawler data compared to the master (v1.0.8.x) branch.
You can now specify in the Details panel for each blog where its files should be downloaded. If the Textbox control is empty, the files are downloaded as in previous releases in the global download location specified in the settings plus the blog name. I though of this feature as some kind of trade off for the feature suggestion to avoid downloading duplicates across different blogs (see #151). Now you can download everything into a single folder, but the blog databases/indexes stay separated. I'm not sure if that's a good approach. There is still a back-mapping missing if you want to "restore" the files for each blog. Since the .tumblr databases are text files, one could script that. This feature will certainly break the metadata files if you download multiple blogs at once as each downloader will compete for accessing the files, but most likely not the dumping of the crawler data in the bullet point above. The global database probably requires a lot of testing. I don't have any experience in how a disk based approach like sqlite will behave and perform if multiple concurrent downloads want to access it, nor how it will handle several million entries and/or how the large the memory usage will be. And a server-client based solution is certainly overkill. I'm not sure about this at all, thus I might remove it again if it has no use. This is completely untested and how it should work in theory. I cannot easily test it due to lack of data/time.

Release a:

Fixed some bugs introduced during the refactoring.
Added an option to load all databases into memory and compare each to-downloaded binary file to all databases across TumblThree before downloading. If it's already in any database, the file is skipped and will not be counted as downloaded. I'm not sure if that's the right approach, since it seems that TumblThree will miss downloads in the user interface, but if we count them, then ppl will complain that files are missing. If we reduce the overall number, then TumblThree will not grab everything .. You can enable it in the settings->global. You'll have to restart TumblThree afterwards.

Release b:

Fixed more bugs introduced during the refactoring. Should be working now even with the "load all databases into memory"-mode.

Release c:

Rewrite of the blog detection during blog addition. Should reduce latency and offline blogs aren't added anymore.
Allows to change the "Load all databases into memory"-setting without application restart.

Release e:

Notifies the user when a connection timeout has occurred. The message states if the timeout has occurred during downloading or crawling.

Release v1.0.8.34:

Loads the libraries and the _files databases in separated tasks. The crawler awaits until the databases are fully loaded into the memory if the option "global databases file check" is enabled. This should hopefully fix the issues mentioned here and below.
Runs the blog online checks at startup in a separate task to reduce the user interface lag.
Same release as the "official" release under the release tab.

Thanks!

apoapostolov commented 6 years ago

Hello,

Thank you very much for considering my request and implementing this.

I see a positive and several drawbacks with this implementation.

Pros:

Now it is possible to merge several thematic blogs into a single download folder, i.e. comics blogs, anime blogs, p*rn blogs, etc. This is very useful when few thematic directories is better than hundreds of directories per blog.

Cons:

Even blogs with different theme may still reblog same content.
Having folder with millions of images makes WinExplorer choke and it takes up to HOURS to index them. Photo management apps may take long time to index, thumbnail them. Makes these folders impossible to browse within acceptable timeframe,

IMHO the best solution would be to load all indexes in memory, and apply them in real time against scanned blog contents. I would prefer if every blog had a setting:

[ ] Skip other tumblog content Hint: Do not download content previously downloaded from other tumblogs.

However I also see the appeal of your implementation and would not suggest that you remove it - it has great potential!

I will be testing it over the next week, and let you know if I encounter issues.

johanneszab commented 6 years ago

IMHO the best solution would be to load all indexes in memory, and apply them in real time against scanned blog contents.

I can add this. It's not hard to do and is actually how it was before this release. It certainly consumes more memory, but might not be that much of an issue with our systems nowadays and of course somewhat scales with the usage. Download speedwise it should certainly be the best option. Since the old releases also leaked some memory I've no real comparison whatsoever.

If you are going to test this, I'll add it later this week. I'd like someone to test it who has downloaded way more content than I have.

apoapostolov commented 6 years ago

Thank you! I will definitely commit to testing this with a setup of 100+ blogs. I will test both the OP "merged directories" method and the latest proposal, when added.

I also started to like the "merged directories" functionality a LOT, so please really don't remove it.

Taranchuk commented 6 years ago

Downloading images works fine. But the program does not download image metadata (in format images.txt), does this mean that this function has disappeared completely from the program? There are xml files here, do they replace this function? I looked at xml files inside, they contain more data, although for me useful information among them is actually not much more than it was in the previous metadata files (I found useful among the new data only the note count), there is more unnecessary information (like avatar urls and links to smaller files) and just a lot of redundant text. Just as an example, here's downloaded xml file for only one file. In an unpacked size, the xml file occupies 15.6 kilobytes on the disk. In a metadata file, this would be only 6 to 8 lines approximately (and the useful information will be almost as much as there is in an xml file). tumblr_omnfiaNZrm1r8wg3no1_1280.zip For interest, I also compared the total size of xml files with the size of one metadata file from the same blog (using the old version of the program) and found that for a blog with 2500 images, xml files occupy 90 megabytes, and one images.txt file occupies 612 kilobytes. The difference is more than a hundred times. It would be nice to reduce the size of xml files somehow. Just personally, I download many metadata files, such a difference is very significant for me. Or if you would return the option to download metadata in the previous format, then it would be very good for me. Thank you very much for hard work and for regular updates!

johanneszab commented 6 years ago

Or if you would return the option to download metadata in the previous format, then it would be very good for me.

It actually was a regression. I've messed up the refactoring that broke downloading of all text posts. I've brought it back in the -a release that I've uploaded in the entry post of this issue.

I think the dump crawler data makes only sense if you use the svc branch. I've uploaded a v1.0.7.40a release that you might want to test. The .json files the svc crawler generate look similar to this (100 posts).

Taranchuk commented 6 years ago

Downloaded v1.0.7.40a release, image metadata are downloaded normally, thanks! I noticed that if enable the option to download files by specific size (1280), then the program for one file downloads json files in at least two versions (250/400/540 and 1280). And if there are files from the photosets, then for them the program creates files with different counters, but with the same content inside. And here there are also other files with names consisting only of digits, they also repeat the content from the files tumblr _ * .json. In general, if in the Total Comember run a search by same size and content, then the program finds many duplicate files. For example, in my folder among the 5340 json files only 1127 are unique, the rest are copies. Here is a screenshot of the search in Total Commander. https://i.imgur.com/kXRDWQn.jpg And here is a screenshot of the selection of all copies of files, except for the unique ones. https://i.imgur.com/cIg4Ldu.jpg In general, if count file sizes, then json files together take up 55.6 megabytes and 47.6 megabytes from them are actually copies. In v1.0.8.33 about the same thing, but it seems the program does not create for one image two versions of xml file when I enable the option to download files by specific size.

Taranchuk commented 6 years ago

Hello, Johannes I just downloaded the new version and found that I can not add more than 50-60 blogs from the clipboard at a time if I copy a lot more addresses. First blogs are added, and then this process stops completely and further blogs are not added. It seems to me that the reason is not in any particular blog, but simply the new version does not process than 50-60 blogs. Here is a list of 3500 blogs, you can see if there is a problem on other machines if you copy the entire list completely, and then from this program adds only the first 50-60 blogs and nothing more. 3500 blogs.txt

johanneszab commented 6 years ago

Ah, I see. I can already imagine what the problem is. I'll fix it right away.

Thanks for testing!

Edit: Should be fixed.

apoapostolov commented 6 years ago

Using 1.0.8.33, Windows 10

I have a suspicion that the new version crashes on attempt to add 18+ blogs. I add blogs directly via entry field in the app, and the application crashes. If I use the clipboard, then it doesn't catch the blog. Tried the same with kitten blogs, and not a single one caused such issue which makes me suspect that this may be related to NSFW tags.

Here is one fairly tame 18+ blog http://torikev.tumblr.com/ that crashes every time on attempt to add.

EDIT: Seems to be random thing, maybe not 18+ related. Many blogs work, others (such as the above) cause crash every time.

Taranchuk commented 6 years ago

I downloaded the new version 1.0.8.33d, the program already adds more blogs, but not all. Of 3500 blogs (most of them are exactly online) the program added only 747 blogs. When I cleared the list of 3500 blogs from the already added 747 blogs and copied the remaining part of the previously missed addresses, the program added only a part of them (288 blogs of 2751 blogs that the program previously missed). On the third time, when I cleared the list of addresses from successfully added blogs and copied it, the program again added only part of them. Thus, the program still does not cope completely with the addition of all addresses in the program. And probably the reason is not the big size of the list because when I tried to copy 50 blogs, the program of them added 28 blogs. And when I only tried to copy only 10 addresses, the program of them added only 5. For an example, try adding these 20 blogs to the program. I checked all of them in the browser and they are exactly online. But of them the program adds only 8 blogs. 20 blogs.txt

update. Tried to manually add the missing blogs to the program and the program was crashed exactly as described above. And the blog http://torikev.tumblr.com/ is also not added from the clipboard and manual addition also causes crash. It seems that this may be a related problem for us.

johanneszab commented 6 years ago

Ah, looks like there were two issues with the new detection. The ones that aren't added in your 20 blogs text file example are safe mode blogs.

Taranchuk commented 6 years ago

If you want to test, then here is a list of 2608 blogs, all the metadata from them were downloaded in the previous version a few days ago, so they are all exactly online. Do not use list of 3500 blogs, it's my old list and some blogs are probably already offline. blogs.txt

Taranchuk commented 6 years ago

With the new version, adding blogs has really improved! I see that all blogs from the list have been added. Very good! Thanks!

apoapostolov commented 6 years ago

Thanks for the quick response! Using 33e version, all blogs that crashed seem to resolve properly now and get catched by the clipboard service. No crashes so far.

apoapostolov commented 6 years ago

Has anyone encountered a very long pause after a blog seems to be downloaded completely and has spent a long time evaluating >100,000 tumblr posts, without proceeding to the next one?

keokitsune commented 6 years ago

ya, after a blog has finished downloading it stalls on that last download and wont move to the next blog.

johanneszab commented 6 years ago

It's too general. I've had it briefly tested and added these 3 small blogs, queued them all, enabled all options (with/without load all databases) and they all finished downloading:

http://wallpaperfx.tumblr.com/ http://mywallpapercollections.tumblr.com/ http://nature-e.tumblr.com/

So, what options did you enable? Can you specify a specific a blog?

I'll probably not code anything until next year. So if it always happens for you, I'll remove the latest releases and see if I can fix it next year.

keokitsune commented 6 years ago

i had global database enabled, images,linked and reblogged selected.

would it also be possible to have an option to load all blog database into one single file. i delete the folder after a blog goes offline because i use the folder names to generate a list to add blogs back the to programs blog list when they dissapear due to crashes ect.

apoapostolov commented 6 years ago

Hello. I had global database enabled, images + videos, included reblogs, set a folder (shared with 1 other blog) and minimum date 20171101 with no maximum date. Forced rescan is on.

After the minimum date was matched, the blog went to collect html of all blog posts without downloading them (probably because of forced rescan). Then it stalled at reaching 48500 of 48474 posts.

johanneszab commented 6 years ago

Well, I cannot reproduce any error, stall after complete crawl or hang after shutdown.

I've added 400 and more blogs, downloaded 15000 and more posts, turned on/off global database with otherwise default settings, shutdown TumblThree during the crawl, added blogs during the crawl -- it behaves exactly like version v1.0.8.32 for me. There was someone on my website who has apparently the same issue, but I am unable to fix it if I cannot reproduce it.

If someone here can provide a more detailed description, or is capable of programming and can debug the issue himself that would be wonderful. Do you get any error messages during normal operation about connection timeouts using v1.0.8.33?

keokitsune commented 6 years ago

I have image,video,reblog,linked and force rescan enabled, I also have the global option selected. The blogs will begin to scan like normal with no apparent issues but will seemingly stall at the last file, no matter how long I leave it the crawl is never completed, no error message or crash or anything, it just wont complete the blog crawl, if I press stop nothing seems to happen, the blog just continues to be unsuccessfully downloaded, and the thing is when you re-released this new update it was downloading fine from what I could tell, but after shutting the program down for the night and restarting it the next day is when it began giving me issues again, deleting app data and redownloading the program doesn’t seem to change anything. I didn’t change or do anything different between that time, it just randomly decided it would no longer work.

johanneszab commented 6 years ago

Could you upload the database files (_files.tumblr + .tumblr or similar) for a blog that stalls and your settings from C:\Users\AppData\Local\TumblThree\Settings\?

keokitsune commented 6 years ago

iv also noticed that on top of the previous symptoms if i whipe the blog list and add blogs and directly crawl them without closing the program everything downloads normal, it will reference and block re downloading from the global database. but if i close the program and reopen and then crawl the blog whether it was an old blog or a newly added one that i haven't done anything with the crawler will download all of the content even if its repeated files, so its ignoring the global database and then hangs on the final file in the blog and remain incomplete.

altho i can remove the blog from the index but keep the _files files and readd the blog to the crawl list and crawl without closing the program and it will work appropriately and remember the files that were downloaded from the failed crawl. so if i just keep deleting the blog list and leaving the files and reloading them into9t he blog list i am able to download things normally, but only if i constantly reload the blog list every time i open the program and sit though it loading the blogs back into the program.

New folder.zip

johanneszab commented 6 years ago

Thanks for the description! Based on it, I've an idea what the problem might be: Maybe the _files files aren't completely loaded and restored in memory when you start crawling. As most of the code is using async-code, and the database loading code too, it actually runs concurrently and might not be finished when the GUI is up. Thus, if you have several hundreds of blogs with a lot of data, and you start the crawl after TumblThree is up, the method for loading the blogs _files databases might still be loading, but the crawl already tries to access them and that fails something I'm catching somewhere else in the code.

Maybe if I find some time in the next days I can quickly add a notification event that fires when everything is loaded. Otherwise you might want to check the Task Manager and see for the Disk I/O of TumblThree after starting it, and then maybe start the crawl a bit delayed. If that fixes the problem, then that was the issue.

I've tested your settings and the single blog file with 2-3 additionally added blogs, closed TumblThree, re-opened it, and it worked.

Edit: That also means there should be no issue if you disable the "global database"-option. Edit2: Okay, using your settings file, I'm rather sure now that this is the issue. Also the databases aren't loaded in the current code until all blogs have been checked for their online status. Since the online status checking is rate limited for regular blogs, it might take some minutes depending on your amount of blogs, until the _files databases are loaded. Edit3: The blogs online check at startup runs on its own task now to reduce the ui lag.

johanneszab / TumblThree

[test] Help testing the upcoming release! #179