hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Bug: ZeroDivisionError for gifs with only 2 frames and a framerate >= 2fps #21

Closed prof-m closed 1 year ago

prof-m commented 1 year ago

Testing it out on my local library, which has a nice mix of gifs and videos, and I noticed one error coming up consistently:

Failed to calculate a perceptual hash.
 17:44:28 - hydlog: division by zero
Traceback (most recent call last):
  File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 186, in _add_perceptual_hashes_to_db
    perceptual_hash = self._calculate_perceptual_hash(video_response.content)
  File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 126, in _calculate_perceptual_hash
    perceptual_hash = VPDQSignal.hash_from_file(tmp_vid_file.name)
  File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq.py", line 60, in hash_from_file
    return vpdq_to_json(hash_file_compact(str(path), seconds_per_hash))
  File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq_util.py", line 70, in hash_file_compact
    vpdq_hashes = vpdq.computeHash(str(filepath), seconds_per_hash=seconds_per_hash)
  File "vpdq/python/vpdq.pyx", line 161, in vpdq.computeHash
  File "<string>", line 1, in <module>

I took a look at all the files that error got thrown for, and they had the following things in common:

  1. They all had only 2 frames
  2. They all had a framerate of 2fps or higher
  3. They were all gifs (though, tbh, that might just be because people wouldn't bother with using a video for something with only two frames)

Based on the stracktrace, my first thought was that the seconds_per_hash calculation was getting tripped up by files that have either 2 frames, or that have a higher framerate than number of frames. But, after taking a look at the code, I'm not so sure? Because as far as I can tell, seconds_per_hash never even gets passed into hash_file_compact() at all - the only calls to hash_file_compact() that I can see only pass in the filepath, and nothing else, so hash_file_compact() always gets called with it's seconds_per_hash param set to 1.

At which point we get to the inner workings of the vdqp library itself, but I haven't looked into the code there yet to see what's making it try to divide by zero. Will update the issue with more info if I figure it out before someone else has a chance to take a look

appleappleapplenanner commented 1 year ago

I figured this would happen, but I haven't taken the time to fix it or get a good test suite of GIFs or extremely short videos.

The problem is going to be in the main hashing function here.

They tried to fix it with by checking if int frameMod = secondsPerHash * framesPerSec was 0 but didn't bother checking for the divide by zero here :

pdqHashes.push_back({pdqHash, fno, quality, (double)fno / framesPerSec})

I'm pretty sure that timestamp isn't even used so it could probably be removed, but I'll probably fix it by just checking if the seconds_per_hash < total frames / FPS and hashing every frame if it is.

If you have time let me know if running FFprobe on the file gives you an actual FPS for the video > 0 by doing: ffprobe -v quiet -show_streams -select_streams v:0 badgif.gif

You can see what file is tripping it up with --debug and then you look up the hash in Hydrus with system:hash.

Just thinking about it, GIFs should probably be hashed every frame either way. For now if this problem is annoying run with the query system:duration > 1s. If the duration is greater than 1s there shouldn't be an issue.

prof-m commented 1 year ago

@appleappleapplenanner Ah, the joys of taking something built for one thing and using it for an entirely different thing 😂

Here's the output I got from one of the files that had the error. Specific line I think you're looking for: avg_frame_rate=0/0

Full output here just in case: ``` ffprobe -v quiet -show_streams -select_streams v:0 "C:\Users\profm\Hydrus\client_files\f05\05420e1d5904be4868aa7d15b8ea339c2a807f36e22f08161d65e06bc50cd62a.gif" [STREAM] index=0 codec_name=gif codec_long_name=CompuServe GIF (Graphics Interchange Format) profile=unknown codec_type=video codec_tag_string=[0][0][0][0] codec_tag=0x0000 width=1096 height=728 coded_width=1096 coded_height=728 closed_captions=0 film_grain=0 has_b_frames=0 sample_aspect_ratio=N/A display_aspect_ratio=N/A pix_fmt=bgra level=-99 color_range=unknown color_space=unknown color_transfer=unknown color_primaries=unknown chroma_location=unspecified field_order=unknown refs=1 id=N/A r_frame_rate=100/1 avg_frame_rate=0/0 time_base=1/100 start_pts=0 start_time=0.000000 duration_ts=40 duration=0.400000 bit_rate=N/A max_bit_rate=N/A bits_per_raw_sample=N/A nb_frames=2 nb_read_frames=N/A nb_read_packets=N/A DISPOSITION:default=0 DISPOSITION:dub=0 DISPOSITION:original=0 DISPOSITION:comment=0 DISPOSITION:lyrics=0 DISPOSITION:karaoke=0 DISPOSITION:forced=0 DISPOSITION:hearing_impaired=0 DISPOSITION:visual_impaired=0 DISPOSITION:clean_effects=0 DISPOSITION:attached_pic=0 DISPOSITION:timed_thumbnails=0 DISPOSITION:captions=0 DISPOSITION:descriptions=0 DISPOSITION:metadata=0 DISPOSITION:dependent=0 DISPOSITION:still_image=0 [/STREAM] ```
appleappleapplenanner commented 1 year ago

Well, it's not so much taking something built for one thing and using it for another than it is how absolutely shit the implementation of vpdq is. It's SO bad.

Anyway, it seems like I should actually be using r_frame_rate rather than avg_frame_rate from FFprobe in vpdq.pyx according to this.

prof-m commented 1 year ago

Oh yeesh, yeah, I just looked at the code you linked in your first reply, and I see what you mean - they've got no coverage for frames per second being zero, despite covering for frameMod. Though I guess, to be fair, framesPerSecond should never be 0 for any kind of image, but it's still bad error handling on their part. Passing in the r_frame_rate as the framesPerSecond argument makes sense to me, if avg_frame_rate is going to give results like 0/0

appleappleapplenanner commented 1 year ago

Should be fixed in hvdvpdq 0.0.12. Run pip3 install hvdvpdq -U to update and test please.

prof-m commented 1 year ago

@appleappleapplenanner That seemed to work! I'll need to let it run on some more files before I'm positive, but I'm no longer seeing the ZeroDivision error pop up when running with --verbose. Thanks for the quick fix!

This has cleared the way for me to notice that I'm seeing a lot of instances of UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1934: invalid continuation byte now, but I'll open a separate issue with a proper write-up for that tomorrow.

prof-m commented 1 year ago

Issue already closed, but just confirming that I ran a search on all of my files and none of them failed, so I think we're most likely good here!