hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1785: invalid continuation byte #23

Closed appleappleapplenanner closed 1 year ago

appleappleapplenanner commented 1 year ago

From profm

Here's another bug I ran into when testing the app against my library. It's not terribly common - I had 18 instances of it in almost 10k files - but it does happen consistently and repeatably for certain files.

For those 18 files, when the program tries and fails to phash them, this stack trace (or one nearly identical to it) is given (in verbose mode):

 Failed to calculate a perceptual hash.
 01:06:37 - hydlog: 'utf-8' codec can't decode byte 0xe9 in position 1785: invalid continuation byte
Traceback (most recent call last):
  File "/home/profm/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 186, in _add_perceptual_hashes_to_db
    perceptual_hash = self._calculate_perceptual_hash(video_response.content)
  File "/home/profm/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 126, in _calculate_perceptual_hash
    perceptual_hash = VPDQSignal.hash_from_file(tmp_vid_file.name)
  File "/home/profm/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq.py", line 60, in hash_from_file
    return vpdq_to_json(hash_file_compact(str(path), seconds_per_hash))
  File "/home/profm/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq_util.py", line 70, in hash_file_compact
    vpdq_hashes = vpdq.computeHash(str(filepath), seconds_per_hash=seconds_per_hash)
  File "vpdq/python/vpdq.pyx", line 159, in vpdq.computeHash
  File "vpdq/python/vpdq.pyx", line 121, in vpdq.get_vid_info
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1785: invalid continuation byte
 01:06:37 - hydlog: Errored file hash: 1fba576ef070dd41cef68b96a173cd0dcfc6400f870eebb6ff6866180f6bd355

Where they differ is in the exact text of the UnicodeDecodeError message - the "byte" is some hexadecimal number, and the position changes as well.

Unlike with the ZeroDivisionError, there's few similarities between all of the files this message was thrown for. Here's what I can say:

My first thought, when I noticed that the majority were made by the same content creator, was that said creator was using a weird encoding that vpdq didn't like. But I don't think that can be it (or at least, not all of it), because there are gifs mixed in with the mp4s, and gifs don't have different encodings like mp4s do.

I went ahead and ran ffprobe -v quiet -show_streams -select_streams v:0 "C:\filepath" on a few of them, and I've included the output below in case it's helpful.

ffprobe run on one of the mp4 files (click to expand) ``` PS C:\ffmpeg\bin> .\ffprobe -v quiet -show_streams -select_streams v:0 "C:\Users\profm\client_files\f1c\1c091dac87222fe1d1a3b904d37125a50b7b5cfcaaaf108c6d11f6c24bce8bc6.mp4" [STREAM] index=0 codec_name=h264 codec_long_name=H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 profile=High codec_type=video codec_tag_string=avc1 codec_tag=0x31637661 width=428 height=280 coded_width=428 coded_height=280 closed_captions=0 film_grain=0 has_b_frames=2 sample_aspect_ratio=N/A display_aspect_ratio=N/A pix_fmt=yuv420p level=21 color_range=unknown color_space=unknown color_transfer=unknown color_primaries=unknown chroma_location=left field_order=progressive refs=1 is_avc=true nal_length_size=4 id=0x1 r_frame_rate=20/1 avg_frame_rate=20/1 time_base=1/10240 start_pts=0 start_time=0.000000 duration_ts=36864 duration=3.600000 bit_rate=384513 max_bit_rate=N/A bits_per_raw_sample=8 nb_frames=72 nb_read_frames=N/A nb_read_packets=N/A extradata_size=46 DISPOSITION:default=1 DISPOSITION:dub=0 DISPOSITION:original=0 DISPOSITION:comment=0 DISPOSITION:lyrics=0 DISPOSITION:karaoke=0 DISPOSITION:forced=0 DISPOSITION:hearing_impaired=0 DISPOSITION:visual_impaired=0 DISPOSITION:clean_effects=0 DISPOSITION:attached_pic=0 DISPOSITION:timed_thumbnails=0 DISPOSITION:captions=0 DISPOSITION:descriptions=0 DISPOSITION:metadata=0 DISPOSITION:dependent=0 DISPOSITION:still_image=0 TAG:language=und TAG:handler_name=VideoHandler TAG:vendor_id=[0][0][0][0] [/STREAM] ```
ffprobe run on one of the GIF files (click to expand) ``` PS C:\ffmpeg\bin> .\ffprobe -v quiet -show_streams -select_streams v:0 "C:\Users\profm\Hydrus\client_files\f44\446ba83242c6fe82754a7fdd539d477942ed9b1ce9a84a45ac19ae076337bc8d.gif" [STREAM] index=0 codec_name=gif codec_long_name=CompuServe GIF (Graphics Interchange Format) profile=unknown codec_type=video codec_tag_string=[0][0][0][0] codec_tag=0x0000 width=540 height=304 coded_width=540 coded_height=304 closed_captions=0 film_grain=0 has_b_frames=0 sample_aspect_ratio=N/A display_aspect_ratio=N/A pix_fmt=bgra level=-99 color_range=unknown color_space=unknown color_transfer=unknown color_primaries=unknown chroma_location=unspecified field_order=unknown refs=1 id=N/A r_frame_rate=20/1 avg_frame_rate=20/1 time_base=1/100 start_pts=0 start_time=0.000000 duration_ts=145 duration=1.450000 bit_rate=N/A max_bit_rate=N/A bits_per_raw_sample=N/A nb_frames=29 nb_read_frames=N/A nb_read_packets=N/A DISPOSITION:default=0 DISPOSITION:dub=0 DISPOSITION:original=0 DISPOSITION:comment=0 DISPOSITION:lyrics=0 DISPOSITION:karaoke=0 DISPOSITION:forced=0 DISPOSITION:hearing_impaired=0 DISPOSITION:visual_impaired=0 DISPOSITION:clean_effects=0 DISPOSITION:attached_pic=0 DISPOSITION:timed_thumbnails=0 DISPOSITION:captions=0 DISPOSITION:descriptions=0 DISPOSITION:metadata=0 DISPOSITION:dependent=0 DISPOSITION:still_image=0 [/STREAM] ```
prof-m commented 1 year ago

Thanks for the catch. 😂

Oh, and actually, I found one lone file that had a slightly different but seemingly related error - instead of saying 'invalid continuation byte', it says 'invalid start byte'.

Stack trace (click to expand) ``` Failed to calculate a perceptual hash. 01:07:24 - hydlog: 'utf-8' codec can't decode byte 0x88 in position 1959: invalid start byte Traceback (most recent call last): File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 186, in _add_perceptual_hashes_to_db perceptual_hash = self._calculate_perceptual_hash(video_response.content) File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/dedup.py", line 126, in _calculate_perceptual_hash perceptual_hash = VPDQSignal.hash_from_file(tmp_vid_file.name) File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq.py", line 60, in hash_from_file return vpdq_to_json(hash_file_compact(str(path), seconds_per_hash)) File "/home/profmisdumb/.local/lib/python3.10/site-packages/hydrusvideodeduplicator/vpdq_util.py", line 70, in hash_file_compact vpdq_hashes = vpdq.computeHash(str(filepath), seconds_per_hash=seconds_per_hash) File "vpdq/python/vpdq.pyx", line 159, in vpdq.computeHash File "vpdq/python/vpdq.pyx", line 121, in vpdq.get_vid_info UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1959: invalid start byte 01:07:24 - hydlog: Errored file hash: 6395a0259cf4dc9b310d472d3919c4d349cdbe70a7a299a5b6722ceb3e467008 ```
ffprobe output for the one weird file to get this error (click to expand) ``` ffprobe -v quiet -show_streams -select_streams v:0 "C:\Users\profm\Hydrus\client_files\f63\6395a0259cf4dc9b310d472d3919c4d349cdbe70a7a299a5b6722ceb3e467008.mp4" [STREAM] index=0 codec_name=h264 codec_long_name=H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 profile=High codec_type=video codec_tag_string=avc1 codec_tag=0x31637661 width=356 height=608 coded_width=356 coded_height=608 closed_captions=0 film_grain=0 has_b_frames=2 sample_aspect_ratio=N/A display_aspect_ratio=N/A pix_fmt=yuv420p level=22 color_range=unknown color_space=unknown color_transfer=unknown color_primaries=unknown chroma_location=left field_order=progressive refs=1 is_avc=true nal_length_size=4 id=0x1 r_frame_rate=10/1 avg_frame_rate=10/1 time_base=1/10240 start_pts=0 start_time=0.000000 duration_ts=532480 duration=52.000000 bit_rate=882611 max_bit_rate=N/A bits_per_raw_sample=8 nb_frames=520 nb_read_frames=N/A nb_read_packets=N/A extradata_size=46 DISPOSITION:default=1 DISPOSITION:dub=0 DISPOSITION:original=0 DISPOSITION:comment=0 DISPOSITION:lyrics=0 DISPOSITION:karaoke=0 DISPOSITION:forced=0 DISPOSITION:hearing_impaired=0 DISPOSITION:visual_impaired=0 DISPOSITION:clean_effects=0 DISPOSITION:attached_pic=0 DISPOSITION:timed_thumbnails=0 DISPOSITION:captions=0 DISPOSITION:descriptions=0 DISPOSITION:metadata=0 DISPOSITION:dependent=0 DISPOSITION:still_image=0 TAG:language=und TAG:handler_name=VideoHandler TAG:vendor_id=[0][0][0][0] [/STREAM] ```
appleappleapplenanner commented 1 year ago

Fixed by hvdvpdq 0.0.13