johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
920 stars 130 forks source link

[Feature Request] Re-Download a Blog function #121

Open falbalus opened 7 years ago

falbalus commented 7 years ago

Would it be possible to add a re-download feature in addition to the rescan feature?

I have deleted some files physically and startet a rescan. To my opinion a rescan should determine that some files are missing, but obviously they are still "stored" somewhere in the index files.

johanneszab commented 7 years ago

I think it's superfluous.

Removing the blog index and re-adding the blog does the exact same thing and doesn't need any additional code. All files still in the download location will be re-added to the blog index when the downloader starts to download them as the downloader checks the file sizes before downloading in order to resume any canceled downloads. If they are equal, it doesn't do anything but adds them to the index as finished files.

Almost the same would happen if I would add some rescan logic.

johanneszab commented 7 years ago

Ok, I'll eventually add it. Comparing the directory with the content of index file shouldn't be too much code. But for now, I think you can do the remove/readd workaround. Only works for binary files though (pictures, videos, audio).

falbalus commented 7 years ago

... because the workaround never worked an inbuilt possibility would be nice.

johanneszab commented 7 years ago

Sure it works. What exactly should not work there?

falbalus commented 7 years ago

I tried it several times.

a) Open TumblThree b) Add a blog c) Download the blog completely. d) Close TumblThree e) Delete some files physically. f) Open TumblThree g) Download with the Re-Scan

Deleted files will not be downloaded again (well at my place).

a) Open TumblThree b) Add a blog c) Download the blog completely. d) Close TumblThree e) Delete some files physically f) Open TumblThree g) Delete the blog h) Re-Add the blog

Deleted files will not be downloaded again (well at my place).

This is my experience ever since, and that is why I asked for this re-download feature ;-)

johanneszab commented 7 years ago

a) Open TumblThree b) Add a blog c) Download the blog completely. d) Close TumblThree e) Delete some files physically f) Open TumblThree g) Delete the blog h) Re-Add the blog

Deleted files will not be downloaded again (well at my place).

Except that they will be downloaded again. I've just tested it. Downloaded 67 images from http://mywallpapercollections.tumblr.com/ , removed 20 images. Deleted the blog (index only) from within TumblThree, re-added it and crawled it again, and 67 images are back.

It wouldn't make any sense otherwise, since all previous information of the first download is gone if you delete the index files.

TehBotolSosro commented 7 years ago

Would it be possible to include file hash function or file size comparison, since i notice there's a lot of 0kb jpg/mp4 in the blog download or some video are corrupted (only downloaded half the duration) and would love that function so i can rescan the blog and fix the corrupted/missing file

thanks

johanneszab commented 7 years ago

I'm probably not adding anything anytime soon.

Lower your parallel connections to <= 8 (?). The tumblr servers drop all open connections every xx seconds if there are too many parallel open connection. I don't know why, but I've noticed that at the same time, all connections get closed, regardless of when they were opened.

Say, if you download 20 videos in parallel, most of them are probably not finished at that time. TumblThree opens each forcefully closed connection again for a few times as specified in "MaxNumberOfRetries" in the settings.json, but I think it's better that not even let that happen in the first place.

I haven't played around more to figure out the exact number of possible connections, but if you only download videos, I personally would set it to 8 since I remember that that number was working without the issue. I've written in this issue more about this. You can check if for yourself if you open the code in Visual Studio, download several videos in parallel and you'll see that all connections get forcefully closed at the same time.

I wasn't sure if I should change the default settings though, since most people probably download images which seem to be okay (since they are smaller).

@TehBotolSosro: What version are you using? Do you have connection issue in general? I basically never have 0 kb images. Whats you parallel connection settings set to?

TehBotolSosro commented 7 years ago

v1.0.8.18 16 connection and 2 parallel

also i have tried deleting the corrupted video and deleting the 0kb file and delete the index, remove the blog, and add them again, and crawl them again.

and it downloaded the video again but it still corrupted, but when i downloaded them directly using idm the video is not corrupted (50mb mp4 video)

tried two times already, will try again with lower connection

usually in 9k file (1 blog) there is 20-40 corrupted video and dozen 0 kb file also why does the connection setting get changed to default when reopen the program even though i already save them

johanneszab commented 7 years ago

Okay. If it works for you with a lower number of connections I'll update the default settings and add a tool tip the the connections slider with some explanatory text.

also why does the connection setting get changed to default when reopen the program even though i already save them

? Doesn't happen here. You do actually press the save button or just close the settings window with the close button?

TehBotolSosro commented 7 years ago

Okay lowering Parallel connection to 4 and increasing the timeout worked fine, the video downloaded wihtout any corruption.

i did press the save button, but i will try more later since it still downloading and don't want to mess it up

two question, if i stopped the download when it still downloading a file and later crawl them again, will the file will get resumed or not?

also do i need to delete the corrupted video before i delete the blog and readd them again, i meant will it check the hash/size? since i need to manually check them before i can delete them.

Edit: changing the connection to 4 and increase the timeout click save, exit and reopen again, and the setting get changed to 16 and timeout 120 again, is it because when i update Tumblthree i usually overwrite the old file? because the token and download folder setting is still there when i reopen the program (not reverted to default)

johanneszab commented 7 years ago

if i stopped the download when it still downloading a file and later crawl them again, will the file will get resumed or not? also do i need to delete the corrupted video before i delete the blog and readd them again, i meant will it check the hash/size? since i need to manually check them before i can delete them.

If the file size is smaller than it should be, it will be resumed. Thus, you shouldn't have to delete anything and re-adding the blog to the queue should finish all unfinished files. Files that had to many retries, like your video files, shouldn't be added to the blogs database and the next crawl will continue on them.

changing the connection to 4 and increase the timeout click save, exit and reopen again, and the setting get changed to 16 and timeout 120 again, is it because when i update Tumblthree i usually overwrite the old file? because the token and download folder setting is still there when i reopen the program (not reverted to default)

First time I've heard this and it has never happened to me. Doesn't make any sense to be honest. The timeout and parallel connection settings get saved at the same time, with the same method as the download folder settings. Thus, it should be the default too if the settings weren't successfully saved.

Do you run multiple instances at once? Do you use the portable mode?

TehBotolSosro commented 7 years ago

thanks for the confirmation,

my portable setting was enabled and no i didn't use multiple instances at once, i also get that error that said maybe one referenced blog doesn't exist anymore,

i did a test though, i move the settings.json and reconfigure it again, and every setting is now saved and no more error refenced blog doesn't exist anymore. strangely when i restore the old settings.json it work as it should. i also check the old settings.json file permission and it's not read only and it the same as the new settings.json

so i don't know what cause that, but it can be solved by deleting the settings.json or move them, open the program so it create another settings json and just configure them again or just restore the old settings.json

thanks

johanneszab commented 7 years ago

so i don't know what cause that, but it can be solved by deleting the settings.json or move them, open the program so it create another settings json and just configure them again or just restore the old settings.json

Weird. The whole settings file is replaced not just parts of it. So again, it doesn't make any sense too me. Whats your windows version?

kingbode commented 7 years ago

where is the setting file, I cannot find it, I removed all blogs and added again , and the same problem happened + preview became not working !! my windows version is Windows 10 Enterprise, problem with version TumblThree-v1.0.8.18

I checked earlier version, 1.0.8.6 , works fine and load blogs in queue with no error

all test is on release app , not source code

johanneszab commented 7 years ago

where is the setting file, I cannot find it

C'mon. Why am I writing all this? On the main page (README.md), under Application Usage -> Saved Settings.

kingbode commented 7 years ago

great, deleting settings fixed the problem, but still preview not working.

johanneszab commented 7 years ago

Not a bug in the application itself. It works here, I haven't touched that code for months.

Did you delete all settings, including the settings.json, where the "disable/enable preview" is stored, added a new blog, and it still doesn't work?

TehBotolSosro commented 7 years ago

Weird. The whole settings file is replaced not just parts of it. So again, it doesn't make any sense too me. Whats your windows version?

yes it's weird indeed as the token and download folder already set up and are not reverted to default when i save it, it just wont accept new setting (like it's read only), windows 10 enterprise

johanneszab commented 7 years ago

yes it's weird indeed as the token and download folder already set up and are not reverted to default when i save it, it just wont accept new setting (like it's read only), windows 10 enterprise

hmm .. maybe the setting files get mixed up?

If you use the portable mode, could you double check that the checkbox for portable mode is set in the Settings? Otherwise it will save your changes in the AppData settings, but might restore from the portable settings? If there is a settings.json next to the TumblThree.exe it will read the settings from there, but that might not be what you've just set/saved.

I'm sure it's not an application issue, but something on your side. You can also edit them manually before opening TumblThree. That will work for sure and you'll know then if the files were successfully read and/or modified..

johanneszab commented 7 years ago

@falbalus: For the Re-Download: If you check the "Check directory for files"-checkbox already downloaded files won't be added to the blogs db and no download/resume/replace occures if there is a file with the same file name alredady. If you don't check that checkbox, already downloaded files will be added to the blogs database, but also not re-downloaded if their file size is equal or larger than the file that would be potentially downloaded.

Thus, the checkbox is actually redundant. I'm not sure right now why I've implemented this in that way, but earlier there was no resume in the TumblThree downloader and all files were replaced once they where handled to the downloader. Maybe that's why. So, basically I can remove the checkbox since the resume in the downloader has to check anyways for already existing files, otherwise it can not continue at the right place.

TehBotolSosro commented 7 years ago

@johanneszab i did check the portable mode, well anyway that problem is solved earlier and i can't reproduce the issue but i will try to edit manually if the problem occur again.

regarding the checkbox, so for me who had many 0 bytes and corrupted video it's better for me to uncheck the "check directory for files"? since i always had that checkbox checked thinking its the other way around (checking the checkbox will check the file size)

also it's possible for the program to know if the video/image are non existent (deleted by tumblr) since i have blog that the progress bar only half and tried several times to delete and readd them, until i manually check the link and turns out the image couldn't also be loaded in browser and there are other files that deleted by copyright.

johanneszab commented 7 years ago

@TehBotolSosro:

i did check the portable mode, well anyway that problem is solved earlier and i can't reproduce the issue but i will try to edit manually if the problem occur again.

I've noticed that the settings don't update if the queuelist cannot be saved once I tried to reproduce the the error. Thus, it probably came from the upgrade to the v1.0.8.18+ version. I actually didn't check that before releasing the newest version. I couldn't even remember that I deleted my settings since I do that so frequently during development ..

regarding the checkbox, so for me who had many 0 bytes and corrupted video it's better for me to uncheck the "check directory for files"? since i always had that checkbox checked thinking its the other way around (checking the checkbox will check the file size)

yes, turn it off. If you enable the checkbox it only checks if there is a file with the same name already. If there is one, it skips the download.

Whereas if you don't check the checkbox, the downloader will start to download the file, checks if there is any existing file and resumes it if the filesize is below the one that's going to be downloaded. If it's the same or above, it won't download anything but set that file als downloaded.

Thus, in the next release I'll remove that checkbox since it has no use.

TehBotolSosro commented 7 years ago

thank you for confirming it, will uncheck the checkmark in check directory for files.

and yes i always update by overwriting the file, glad you found the issue, will delete the Queuelist.json next time i update

thanks again

keokitsune commented 7 years ago

While I would like an option to reset blogs (specific file types anyways, like just video so I can redownload corrupt files without having to redownload all gif/jpg ect) the fact that the app keeps tract of what has and hasn’t already been downloaded based on the file and not on the contents of the folder is very important to me, a feature I wish ripme had. I use this program as an archiver for my favorite blogs but there’s a lot of overlap in the content, the way the program works now allows me to download entire blogs and have a duplicate program scan and remove all redundant content which would otherwise bloat my HDD and after scanning I don’t have to worry about it continuously redownloading the just removed content each time I crawl for new content…like ripme does.