hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Bug: Not All Videos Always Checked For Duplicates #56

Closed prof-m closed 2 months ago

prof-m commented 2 months ago

Bug Description

HVD (at least sometimes) does not compare all files in it's database against one another.

Steps to reproduce

  1. Check out commit 7e5e68feaeb73ef8d7086e6cf067b86dbc66a80d locally (it's based off the latest main branch)
  2. Set up some collection of files for HVD to process - minimum 3 files. (I did this by setting up a new local file domain on my hydrus client, adding a few files to it, and passing the file domain service key to HVD, but there are lots of options)
  3. Ensure a fresh SQLite DB by either pointing HVD at a new folder or deleting the videohashes.sqlite file it's currently pointing at.
  4. Run HVD with at least the following options python3 -m hydrusvideodeduplicator --clear-search-cache --verbose --job-count=1
  5. Observe the log output (I like to copy it into a .txt file)

Expected Results:

Actual Results:

Full Story + More Explaining + Theorizing + A Janky Fix

Hey hey hey, guess who's back again? answer: me, with another fun bug 💯

Fair warning, I'm throwing this one over the fence early in the morning, so it's not gonna be up to the reporting standards I usually like to keep. Apologies in advance, but happy to answer any questions/come back and clean it up later if you need.

Okay, so - basically, I still run the deduper semi-regularly, but there were times where I got the nagging feeling that it wasn't always spotting dupes that it should be. It was hard to pin down, though, until recently, when I had a set of 15 files, 7 of which were "originals" and the other 8 of which were the exact same - just either optimized gifs or mp4s made from gifs. Each pair (and one triplet) had the same number of frames, same resolution, etc etc - just different file formats. So, I ran the deduper, hoping it would give me 9 potential duplicates in hydrus (6 for the 6 pairs, and another 3 for the triplets).

Spoilers: it did not give me 9 potential duplicates - instead, it gave me 5.

Initially, I thought it had something to do with the pairs that weren't showing up as potential duplicates. Was there something weird about those files? So I deleted my database, and started fresh with only one pair of files (lets call them Pair A) getting fetched, hashed, and compared, so that I could see what was going wonky there. Except, without changing anything in the code, suddenly, HVD was reporting Pair A as dupes to Hydrus! Basically:

I ran this experiment a few times to be sure, but the results were always the same. So, what gives?

I started looking at the logs more carefully. I had the --debug flag turned on, so I was getting a nice log message for every file comparison that looked like:

23:59:59 - hvd: Similar 0.0%: "{hashA}" and "{hashB}"

Which was great! So I searched the logs for the two hashes associated with my Pair A files. Both hashes show up in various comparisons in the log... but never in a comparison against one another. Somehow, my Pair A files were never actually getting compared.

I found this hard to believe, so I did two things:

  1. I set the --job-count parameter to 1, which effectively turns off parallelization. Just to make things easier to follow.
  2. I added more logs. A lot more logs.

In particular, I wanted to see just what set of hashes was being compared against each hash on it's journey through comparison land. In the line

                            parallel(
                                delayed(self.compare_videos)(
                                    video1_hash,
                                    video2_hash,
                                    row["perceptual_hash"],
                                    hashdb[video2_hash]["perceptual_hash"],
                                )
                                for video2_hash in islice(hashdb_simple_copy, row["farthest_search_index"], None)
                            )

I wanted to know just what results islice(hashdb_simple_copy, row["farthest_search_index"], None) was returning.

The answer was... pretty interesting!

Here's a log snippet from the first video hash, 6c251..., to be compared against all the others:

00:20:14 - hvd: *** Processing video1_hash 6c251...: original fsi is not set, current fsi is 1, while total hashdb length is 15
 00:20:14 - hvd: *** Processing video1_hash 6c251...: list of hashes to be searched should be:
    1: 04f86...
    2: a4452...
    3: bb792...
    4: b86e3...
    5: e9649...
    6: f2767...
    7: d20e1...
    8: 46834...
    9: 53efa...
    10: d5715...
    11: e8d3a...
    12: f9917...
    13: 0ac47...
    14: 5d1d6...

Overall, it checks out - first file of 15, it's got 14 files to compare against, and all 14 files are there as expected. Hunky dory, everything's cool.

Then we move onto the second video hash to be processed, 04f86..., and it's log snippet looks like this:

00:20:14 - hvd: *** Processing video1_hash 04f86...: original fsi is not set, current fsi is 2, while total hashdb length is 15
 00:20:14 - hvd: *** Processing video1_hash 04f86...: list of hashes to be searched should be:
    1: bb792...
    2: b86e3...
    3: e9649...
    4: f2767...
    5: d20e1...
    6: 46834...
    7: 53efa...
    8: d5715...
    9: e8d3a...
    10: f9917...
    11: 0ac47...
    12: 5d1d6...
    13: 6c251...

Okay, 13 files, sure, checks out, seems good, but hold on - what's 6c251..., the file we previously just processed, doing on that list? And on the bottom of that list, no less? And while we're at it, a4452... was number 2 on the first list, so why is it nowhere to be seen on the second list, when it should be number 1 now?

TBH, I don't 100% know the answer to those questions, but I do have a theory.

I think that a SqliteDict, like the one we use in HVD, is constantly re-arranging itself whenever data gets updated. Specifically, I think that, if you call something like:

row = hashdb[video1_hash]
//...
row["farthest_search_index"] = total
hashdb[video1_hash] = row

That, within hashdb, the original entry for video1_hash is going to get popped, and then it's going to append the new entry for video1_hash to the "bottom" of it's internal array. So, basically, after every video finishes it's comparison loop, it gets popped from the top of the "stack", everything below it "moves up" one index, and then it gets appended to the bottom of the "stack".

That means that, when HVD grabs the islice of effectively "index+1", what it's actually getting back is skipping the video now at the top of the "stack" that it should be comparing against, while accidentally including the file it already finished comparing.

To test this theory, I implemented a really simple, naive fix - make a simple ordered copy of all the video hashes in hash_db before we ever start modifying anything, and use that when doing the islice() operations. And lo and behold, it works! When I ran with that fix, I got all 9 potential duplicates I expected out of my set of 15 files, and the log files looked like you'd expect them to.

Now, I doubt I have to tell you this, but just to be clear, this PR is not meant to be merged as is! It's more of a demonstration PR.

First commit, 7e5e68feaeb73ef8d7086e6cf067b86dbc66a80d, contains only the additional logging statements. Check it out and test using the instructions above.

Second commit, 01e71b68e85195df75ca0a35f86249a4a0ddd559, contains just the fix itself. There are a number of ways this fix can probably be implemented, but that's the quick and dirty fix I tested and was confirmed working with my set of files. It got late, so I haven't tested more extensively, but it looks like you've got a handy dandy test DB you can use for regression testing now, so hopefully that'll help!

Lemme know what you think/hit me up on discord if you want to chat about it. Loving the new documentation and pipelines, btw, it's feeling real legit around here!