Option to compare only cached videos

theophanemayaud commented 1 year ago

  > So if I understand correctly - I need to rescan anyway? The way I've imagined it was that I can somehow resume reviewing what has been already added to db

The cache is for video thumbnails and metadata. When scanning again, it will check what videos are on disk, and if it sees a match in the cache it will skip thumbnails and metadata retrieval which speeds up considerably the scan. But it still scans. I'm writing down the idea you're implying, of only using cached data. That would indeed be great in the case of very very large number of videos. There are a few hurdles because of the current implementation. If only cached data is used, we shouldn't be able to change the thumbnails type but only keep the cached ones, but otherwise it can be simply done with a "Only use cached data" switch, which would skip the scan phase and directly go to the comparison phase. Thanks for the great idea !! It'll also help me in my workflows a lot ! I don't have 1TB but still > 200GB which does take ~10 mins to rescan even when cached. Without cache it's more like 30 mins to an hour, but your method could even reduce further.

Originally posted by @theophanemayaud in https://github.com/theophanemayaud/video-simili-duplicate-cleaner/discussions/80#discussioncomment-3750178

theophanemayaud commented 1 year ago

Ideas :

Could simply skip videos if not already in cache but keep current approach of scanning for files, then trying for cache. Pros/cons : faster if a ton of cached videos, but not a big portion from the current folders being scanned.
Or reverse the behavior : it finds video files within wanted directories, then checks wether they are cached. But it could instead check cached videos and see if they are within wanted directories. Pros/cons : faster if a lot of videos in scanned folder, but not all already in cache.

Conclusion: There's a question of performance, which is faster between OS finding of videos than matching to cache, or database selection with specific path than loading. I think the latter is faster, DB operation with LIKE operator seems feasible and fast. Also I think cache is mostly used within the confine of a specific folder, not storing data across a very long time. People should delete the cache from time to time, and mainly use it to resume comparisons of a large dataset of specific videos in a given folder. Therefore, there would be many videos in the folder, with only a subset cached successfully. The second approach should thus be faster in general.

Steps :

[x] create radio selection for using cache
[x] adapt yes/no caches to behave like previous behavior
skip scanning of videos instead only using cached video locations but make sure to still only compare videos within user chosen folders/path :
- Current behavior MainWindow::on_findDuplicates_clicked
  - folders are scanned for video files with findVideos(dir). But we should skip this.
  - videos are processed processVideos(), via Video::run, meaning each's
    - with getMetadata(filename) metadata is loaded from cache or retrieved with QFileInfo and ffmpeg (getMetadata(filename)). NB Db doesn't cache the modified date.
    - with takeScreenCaptures(cache) screen captures are loaded from cache or taken with ffmpeg. Nb in cutEnds, screen captures are checked to see if they're all black.
- TODO :
  - [x] Implement Db function to get all videos in cache with pathNames that start with selected folders
  - [x] Load metadata and get modified date from those videos
  - [x] Load screen captures from those videos
  - [x] Start comparisons of those videos
[x] compare only cached videos
[x] check for bugs 🐜 🐞 🐛 🪲

theophanemayaud commented 1 year ago

closed by d922af2db584195f8e4a05295b72e3361c61e186

theophanemayaud / video-simili-duplicate-cleaner

Option to compare only cached videos #89