hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
42 stars 7 forks source link

Use a more efficient datastructure for searching for duplicates #28

Open ianwal opened 1 year ago

ianwal commented 1 year ago

Current searching requires comparing every video at least once, it takes n*(n-1)/2 comparisons.

There exists more appropriate data structures for searching for similar objects.


Vantage-point tree

A vantage-point tree would require rebuilding the tree when videos are added, O(n * log(n)), but searching would be greatly improved.

The distance function for the VP tree would be the similarity of a video. Currently, that is a number [0,100].

It could also be possible to store the individual PDQHashes in a VP tree, but the tree would become extremely large with large videos, and it would require storing a table for pdqhash:video. It would likely reduce search time though because it would avoid calculating the similarity of videos while traversing the tree.


Another potential optimization is bucketing videos by their video duration. There's not really a point in comparing videos that are significantly different lengths since they are not likely to be duplicates. Of course, if a video is 1 second vs 2 second that is much different than 1 hour vs 2 hours, so they can't be compared just by their ratios.

KennethSamael commented 1 month ago

There's not really a point in comparing videos that are significantly different lengths since they are not likely to be duplicates.

I feel it's worth mentioning that I have gotten useful duplicates from videos of significantly different durations; such as when the shorter video is a single scene clipped out of a longer video. Might not be relevant for everyone, so it could still be useful to have the option to only look for dupes among videos with similar duration, but it's definitely not pointless to compare videos of different durations.