tw4l / brunnhilde

Siegfried-based characterization tool for directories and disk images
MIT License
82 stars 11 forks source link

UnicodeDecodeError #36

Closed tw4l closed 2 years ago

tw4l commented 6 years ago

brunnhilde_error

Need to add error handling around unicode handling

bdietz commented 2 years ago

Hi Tessa, I believe I ran into a related (or the same) error today. Here's the output:

Traceback (most recent call last): File "/Users/[user]/anaconda3/bin/brunnhilde.py", line 1387, in main() File "/Users/[user]/anaconda3/bin/brunnhilde.py", line 1362, in main args, source, cursor, conn, html, siegfried_version, use_hash, ssn_mode, File "/Users/[user]/anaconda3/bin/brunnhilde.py", line 949, in process_content use_hash = import_csv(cursor, conn, use_hash) File "/Users/[user]/anaconda3/bin/brunnhilde.py", line 295, in import_csv for row in reader: File "/Users/[user]/anaconda3/bin/brunnhilde.py", line 285, in x.replace("\0", "") for x in f File "/Users/[user]/anaconda3/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 6398: invalid start byte

I ran the error past Dianne D. who said:

My hunch is that it's failing again in the except block and pointed to these lines:

https://github.com/tw4l/brunnhilde/blob/98882e842320fc77c58b67724f46e154b2917381/brunnhilde.py#L287 https://github.com/tw4l/brunnhilde/blob/98882e842320fc77c58b67724f46e154b2917381/brunnhilde.py#L290

But she also said she wasn't entirely sure.

We did test brunnhilde against another directory of files and it worked as expected.

We're using OSX 12.1 and managing brunnhilde via pip.

Thanks! Brian

tw4l commented 2 years ago

Thanks @bdietz ! I think Dianne's right that the except block routine there isn't working as intended. Will take a look shortly! If you could share the offending CSV file by email that would be helpful!

bdietz commented 2 years ago

Yeah, for sure. Thank you @tw4l. I'm sorry to ask, do you want the siegried.csv file? IIRC, that's the point at which brunnhilde was failing, after siegfried wrapped up.

tw4l commented 2 years ago

@bdietz yes exactly, and thanks for clarifying!

tw4l commented 2 years ago

Hey Brian, I have a fix in the development branch that uses Python's errors argument with the "ignore" setting, which will silently skip characters that would otherwise throw a UnicodeDecodeError when reading the Siegfried CSV. I've tested that this resolves the issue with the Siegfried CSV file you emailed me. I'm just running the tests now and then will cut a 1.9.4 patch release with the fix and push it to PyPI this week :)

tw4l commented 2 years ago

Thanks for the detailed bug report and the nudge to get this fixed!

bdietz commented 2 years ago

Awesome. Thank you!

tw4l commented 2 years ago

Fixed in commit https://github.com/tw4l/brunnhilde/commit/09defef8270d8647f5f49acdc27a0625d951ab15. The fix is now released to PyPI so sudo pip3 install --upgrade brunnhilde should fix it for you!