Possible stalling/long timeouts on interrupted network connections.

Taranchuk commented 7 years ago

Recently I often began to lose my Internet connection due to a bad internet provider, which led to the detection of bugs during the use of the program. 1) When the Internet disappears during downloading of images, the crawling does not stop and the program starts to create files with a zero size (0 bytes) until the crawling process does not end or the Internet connection does not resume. Also when I downloading image metadata and during this happens the loss of Internet connection then the program does not add anything to the metadata files, but the process goes on until the end of crawling so after this I get unfinished metadata files. It would be very nice if you could do so that the crawling stopped, as soon as the Internet connection disappears, for me it's a little uncomfortable to delete files with a zero size every time when Internet connection disappears and again put these blogs on the queue panel. 2) I found that the function of checking directories for files does not work for me when I need to download not downloaded files and files that I have already deleted (for example, the same files with a zero size). I put blogs on the queue panel with functions of forced scanning, checking directories for files and downloading files with a specific size (1280) and at first the program downloads new posts that were added to the Tumblr blogs after the last download, but after this the program does not download deleted files and remained files that should have downloaded. As soon as the process goes to the end the blogs just stops, they does not leave the queue panel and does not free up space for the following blogs (similar thing happened to me 2 weeks ago with downloading image metadata before release the fix https://github.com/johanneszab/TumblThree/issues/113). So since the function of downloading deleted and non-downloaded files is currently does not work properly for me, I have to use the list of links, delete all existing files from the list and download these files by using another downloader program. And if it somehow helps to clarify the situation, then to everything above written I will add that I use the latest version of the program, I use Windows 7 and .NET Framework 4.0.3 version 64bit, I download only images with a specific size and image metadata from NSFW blogs, all files are downloaded in a separate folder on another disk, not in Tumblthree folder, which lies on the C:\ drive. Also here is a screenshot of the program settings: http://i.imgur.com/BtDUqjs.jpg

johanneszab commented 7 years ago

Force rescanning does not redownload anything. It's just scanning the whole blog instead of going only to the last saved post from a successful previous download. Thus, this option basically only updates the statistics (number of download, duplicates, ..) accordingly but might consume several minutes on large blogs. E.g. on a 200'000 posts blog if you've already downloaded 190'000 posts it would takes several minutes scanning the whole blog due to the rate limiting but not yield any new downloads. That's why the default scanning only scans the newest 10'000 posts but it cannot update the statistics right now. Even though I think there should be enough data in the databases to even reconstruct the number newly added duplicates in the "scan only new posts until the last date" and thus it would be possible to even update the statistics there.

I don't know why people keep asking this. For me it made no sense to implement a "re-download" option for deleted files. Why would I delete them in the first place? And with the directory scanning all you have to do for re-downloading deleted files is to delete the blog (index) and re-add it and just do the whole download again. Since it scans the the blog folder for already available files, it will only download missing files.

johanneszab commented 7 years ago

I've tested it with disconnecting from my network for some minutes. For me just readding the blog to the queue works just fine. There is no need to delete any of the 0 bytes files, but it also doesn't hurt either. None of the incomplete files were added to the database in my test, so crawling the blog again completed the 0 byte files and also added metadata.

The grabber/downloader stalls however for some reason if I disconnect from the network and then try to stop the crawl during the disconnection time.

johanneszab commented 7 years ago

I think I've found the issue of the stalling.

From the MSDN docs:

The Timeout property has no effect on asynchronous requests made with the BeginGetResponse or BeginGetRequestStream method.

I wasn't aware that GetResponseAsync uses BeginGetResponse inside and that it would be different for GetRespone and GetResponeAsync at all.

So basically there is no timeout and that's why the crawler/downloader ignores the stop and/or broken connections and just sits there waiting.

Edit: The only useful alternative is the HttpClient class. But it seems to work differently than using WebRequests as mentioned here. They say it should only be instantiated once. Also the HTTP responses don't throw exceptions anymore. Thus, I'll not update anything soon as it requires more code changes without time for real testing as everything depends on it.

Taranchuk commented 7 years ago

Hello, Johannes. I'm glad you found the issue of the stalling. I'm sorry to be late with the reply, I was busy these days. The thing is, the directory checking function works, but only on small blogs (up to 10,000 images), on large blogs it does not work, even after removing index files and restarting the program and redownloading without the function of forced scanning, only the directory checking function. I tested on large folders, moved thousands of files from these folders to another folder and launched the program to redownload these blogs with the directory checking function (without the function of forced scanning) and it only downloaded a few dozen files to the first 2 blogs and after that the download stops (on the download panel I can see only the words "downloading tumblr_something_1280.jpg" or "Calculation unique downloads, removing duplicates") and these blogs continue to remain on the queue panel, without freeing up space for the following blogs. I deleted these blogs from the queue panel, restarted the program and added other blogs to the queue panel and again it all happened again. I guess that the directory checking function simply does not work on large folders, that's the reason. Did you check on the large blogs (10 000+ images), tried to move thousands of files from the blog folder to another, and then run the crawling with the function of checking directories and compare the number of downloaded files with the number of moved files in another folder? For me, the function of checking directories does not work on the following blogs (I tested them): http://big-boobs-guru.tumblr.com http://eymard19.tumblr.com http://sintillator.tumblr.com http://blanche-gandon.tumblr.com All these blogs contain from 14,000 to 30,000 images, if it's not difficult, try to test on some of them, in case you do not have large blogs for testing. Previously, I found this bug when I downloaded these blogs: http://magicbeauty.tumblr.com http://partsninja.tumblr.com And when during downloading the Internet disappeared, there were already downloaded a lot of files and I saw also many files with a zero size, I thought that these files were corrupted, that's why I deleted them and after this I tried to redownload these blogs by function checking directories, but without success, I guess because they are also large blogs (over 150 thousand images).

Taranchuk commented 6 years ago

I guess that I need to close this topic since all the items described have become irrelevant with the release of new versions. And yet I'm not native English, so I do not quite understand the meaning of the word "stalling", online dictionaries offer many meanings of this word. If this means stopping the download process during the download, then the last update fixed this too, at least in my case. I wrote about this here https://github.com/johanneszab/TumblThree/issues/132#issuecomment-331896552. Since this bug has disappeared and everything else written in my first comment is no bugs really, then I close the topic so as not to clutter up the issues section. Thanks very again for the work, I really enjoy the new version of the program, download goes very smoothly and without problems.

johanneszab / TumblThree

Possible stalling/long timeouts on interrupted network connections. #116