theophanemayaud / video-simili-duplicate-cleaner

https://theophanemayaud.github.io/video-simili-duplicate-cleaner/
GNU General Public License v3.0
51 stars 5 forks source link

Option to compare only cached videos #89

Closed theophanemayaud closed 1 year ago

theophanemayaud commented 1 year ago
  > So if I understand correctly - I need to rescan anyway? The way I've imagined it was that I can somehow resume reviewing what has been already added to db

The cache is for video thumbnails and metadata. When scanning again, it will check what videos are on disk, and if it sees a match in the cache it will skip thumbnails and metadata retrieval which speeds up considerably the scan. But it still scans. I'm writing down the idea you're implying, of only using cached data. That would indeed be great in the case of very very large number of videos. There are a few hurdles because of the current implementation. If only cached data is used, we shouldn't be able to change the thumbnails type but only keep the cached ones, but otherwise it can be simply done with a "Only use cached data" switch, which would skip the scan phase and directly go to the comparison phase. Thanks for the great idea !! It'll also help me in my workflows a lot ! I don't have 1TB but still > 200GB which does take ~10 mins to rescan even when cached. Without cache it's more like 30 mins to an hour, but your method could even reduce further.

Originally posted by @theophanemayaud in https://github.com/theophanemayaud/video-simili-duplicate-cleaner/discussions/80#discussioncomment-3750178

theophanemayaud commented 1 year ago

Ideas :

Conclusion: There's a question of performance, which is faster between OS finding of videos than matching to cache, or database selection with specific path than loading. I think the latter is faster, DB operation with LIKE operator seems feasible and fast. Also I think cache is mostly used within the confine of a specific folder, not storing data across a very long time. People should delete the cache from time to time, and mainly use it to resume comparisons of a large dataset of specific videos in a given folder. Therefore, there would be many videos in the folder, with only a subset cached successfully. The second approach should thus be faster in general.

Steps :

theophanemayaud commented 1 year ago

closed by d922af2db584195f8e4a05295b72e3361c61e186