Use a more efficient datastructure for searching for duplicates

Current searching requires comparing every video at least once, it takes n*(n-1)/2 comparisons.

There exists more appropriate data structures for searching for similar objects.

Vantage-point tree

A vantage-point tree would require rebuilding the tree when videos are added, O(n * log(n)), but searching would be greatly improved.

The distance function for the VP tree would be the similarity of a video. Currently, that is a number [0,100].

It could also be possible to store the individual PDQHashes in a VP tree, but the tree would become extremely large with large videos, and it would require storing a table for pdqhash:video. It would likely reduce search time though because it would avoid calculating the similarity of videos while traversing the tree.

Another potential optimization is bucketing videos by their video duration. There's not really a point in comparing videos that are significantly different lengths since they are not likely to be duplicates. Of course, if a video is 1 second vs 2 second that is much different than 1 hour vs 2 hours, so they can't be compared just by their ratios.

hydrusvideodeduplicator / hydrus-video-deduplicator

Use a more efficient datastructure for searching for duplicates #28

Vantage-point tree