[Feature Request] Avoiding Duplicates with Global Database

apoapostolov commented 6 years ago

Hello,

I am currently using another Github downloader project (won't name it to avoid conflict of interest, also it is almost abandoned at this point) that has a very important feature - it spends time to index all downloaded content first, then downloads only unqiue files, and when finding duplicates it creates a hardlink.

I am unaware if this app has this feature - please excuse me if the answer is yes - but if not, it is a huge boon for people who download tons of tumblr sites that share similar content.

I am currently caching ~1000 tumblr blogs on 4Tb of drive. I would consider switching to your app if this feature is available, and would pay for it if needed.

Thanks.

EC-O-DE commented 6 years ago

Good idea. +1.

I'd also like to see an app that can scan local content (images, videos) and then turn to Tumblr.com to seek _raw and/or _1280 files.

So that the feature would scan local content and then try to locate raw or HD versions from Tumblr.com.

apoapostolov commented 6 years ago

I would donate to the team if mine and your idea are considered for development.

I have huge (2.5TB) downloaded content from other Tumblr downloader. If I could import them in TumblThree, redownload them from 1280 to raw, and wipe duplicates across all blogs, and this I could move from one drive to a bigger 8Tb drive without redownloading the whole insane structure again, that is totally getting my money.

Just saying.

EC-O-DE commented 6 years ago

That would be so good!

Hmm I wonder if Google Reverse Image Search and Yandex Reverse Image search APIs could be featured... Hmm. Perhaps another project :)

apoapostolov commented 6 years ago

So to summarize:

Along with each blog's index, maintain a global index. If a blog is to download something already downloaded from another blog, do not download, create hardlink.
Allow importing downloaded archive from another instance of TumblThree or another downloader apps. Recognize the name format and size extension, and create an index for that blog, so future crawling will only find new files.

johanneszab commented 6 years ago

How things work right now is that each blog has a simple database (just a class stored in json format) that contains all the image/video/audio names/post id for text posts once downloaded.

If TumblThree downloads a post for a blog it first checks in this database. If the post is already there, then it skips the download, otherwise it will request the file size from the file to download from the webserver. Then it checks if there is already a file with that name and continues the file. Thus if there is a image/video/audio file in the blogs download folder but not in the database, TumblThree just adds the file to it's database as completed since nothing can be resumed. If it can be resume because the download was stopped, the file is resumed and then acknowledged as completed and stored in the database. If there is no file at all, it will completely download it. This functionality allows to resume blogs from different downloaders/previous versions without re-downloading a already completed file (if it has not been renamed) and also saves some directory lookups which might be costly if the blog contains several 100'000 files. Also after downloading a file you can safely move it a different location and it won't be downloaded again since it's registered in the database.

In principle we could either check all blogs databases instead or add a global database. The global database has the advantage that you could also remove blogs from TumblThree, but the file will not be downloaded again. Maybe be should use a real database then a simple class stored in memory, I don't know. Some performance testing would be necessary since that database is likely to contain several millions of entries.. And if we start the global database thing, we could also add the md5/sha hashes for each files and compare them additionally instead of just filenames. Comparing hashes requires to download a portion of the file however, thus it's more heavily on the bandwidth. When I started TumblThree I actually wanted to save bandwidth but never disk space since disk space is cheap. But after using tumblr more for myself, I see that there can be easily accumulated several thousands of duplicate files ..

johanneszab commented 6 years ago

Oh I forgot to mention, I'm not a huge fan of a (hard) link approach. Since I'm a Linux user myself and you encounter links more frequently in a POSIX OS, once you move anything out of your file structure, you end up with a mess. Since we're talking about several hundreds of thousands of files (and possible links), I honestly think this isn't the right thing to do. People move things if the space is running out, hard disk capacity getting larger and larger, .. thus this will surely result in problems. I'm almost certain about it.

johanneszab commented 6 years ago

Overall it's a nice idea and was requested several times already. It should be more or less straight forward to add since the code is rather clean and I can already imagine where you could easily extend it to add this feature.

apoapostolov commented 6 years ago

Hello,

First I apologize that I cross posted a week ago, I was just asking whether the database you were talking about was the same that could facilitate my request. I was out of bounds and maybe came up pushy.

Now on your explanation, it makes sense to add global database and make "download duplicates" an option, on or off by default depending how you feel about it. Or, you could make global database made on each run by collecting local databases.

Each approach has its pros. A global database makes it easier to read and write I guess, but is monolithic and can't move folders in and out easily. A runtime generated database takes a while to process (i.e. read 1000+ text files, as in my case) but will allow folders and indexes to be moved around, added and removed to storages easily. I am in favor of the second approach.

Hardlinks is something the "other" tumblr downloader I use right now does, and it has positives and negatives. A positive is that when user goes to a folder, he will see all images there even if they do not eat space. Negatives are plenty - from bloating the archive when moving files from one storage to another (hardlinks I think convert to files) to slowing Explorer when reading a folder of 10,000+ files.

If you don't feel like adding hardlinks, it's totally OK as long as we can avoid duplicates.

Taranchuk commented 6 years ago

I like the idea of a global database, it would be very good if the program does not download files with the same filename if they have already been downloaded from other blogs.

If the global database will be implemented in the program, I would also like to see the following two options: 1) The option to convert the index files into the database, it will be better already have a ready database and there will be no need to start all over again. Also, if the index files from different blogs can contain two or more files with same filenames, then the database should get one from these to reduce the size of the database itself.

2) If it will possible to check files on hash sums and not to download duplicates, even if they have different filenames, it would be desirable to see in the program an option to disable this function in case if some users do not need it. In my case, I would just like to just enable the option of checking file names for duplicates and nothing more. Why do I need this, firstly, for better performance, and secondly, this is also necessary because some files contain a post captions and tags that I add to their filenames, and some files do not contain them and then in this case I would like to be able to decide to leave them on disk or delete them using the duplicate image search program that has the necessary filters to delete such files (for example, a filter for deleting files that do not have a space in the filenames or a filter for deleting files that are have the shortest filenames among duplicates).

keokitsune commented 6 years ago

this would make things on my end a billion times better, i have over 2k blogs and have an ungodly amount of duplicates that my duplicate file finder is running almost constantly just to remove them all.

johanneszab commented 6 years ago

So, after a few weeks, does it help actually to compare the filenames across all loaded tumblrs, or are there still plenty of duplicates with different names, or does it work at all?

keokitsune commented 6 years ago

there are still a few duplicates that slip by but it a lot better. i do still have to delete the blog info in the index and re-add the blogs every time a close the program tho, iv tried leaving the program on over night to load everything but it still locks up when i download a blog.

On Friday, January 5, 2018, 9:12:17 AM HAST, Johannes Meyer zum Alten Borgloh <notifications@github.com> wrote:

So, after a few weeks, does it help actually to compare the filenames across all loaded tumblrs, or are there still plenty of duplicates with different names, or does it work at all?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

johanneszab commented 6 years ago

Maybe it just seems to lock up, but it's actually doing comparisons. Have you enabled the "force recheck"-option?

With the global database, you kill some parallelism since I'm using a global lock for file checking. Otherwise it could be possible that two different tasks (blogs) happen to download the same file at the same moment. Additionally, if you crawl a blog with a lot of posts that have already been downloaded, most of the time TumblThree will sit there doing nothing on the UI since the progress updates in the queue only displays actual downloads, but not skipped files.

The crawler always adds every post from the whole blog. It doesn't do any preliminary existence checking. Thus if the blog has 250.000 posts and you've already downloaded 240.000, you'll have to wait until the downloader has compared the 240.000 posts you've already downloaded to all other posts in TumblThree. If you disable the "force recheck"-option the crawler will stop at the first post (based on the post id) that was already downloaded. Since it starts with the newest ones, only newly posted posts will end up the "comparison"-queue. That's what you want in this case.

But either way, adding a "skipping already downloaded file .."-progress update is only a few lines of code. I'll add that. Thanks!

johanneszab commented 6 years ago

Ok, I've used TumblThree for myself now for two weeks and there are still duplicates with different tumblr hashes. That's unfortunate because the current implementation allows to filter them before the actual download. But since the tumblr hash can be different for the same file, if we want to detect duplicates, we need to hash them by ourselves. That's not a big issue, hashing is a common task and already implemented in .NET, thus it's probably just a few couple of lines. We can add the hashes then to the database or/and generate a global database as well with just the hashes. The downside is -- as already mentioned -- we need to download the file (twice), then generate the hash, then delete the duplicate.

I'll try to implement this once I've some spare time. Probably at sometime in March.

johanneszab / TumblThree

[Feature Request] Avoiding Duplicates with Global Database #151