reubano / meza

A Python toolkit for processing tabular data
MIT License
416 stars 32 forks source link

UnicodeDecodeError when reading MDB file #46

Open bv-do opened 1 year ago

bv-do commented 1 year ago

When loading an MDB file, it's throwing a UnicodeDecode error immediately upon reading the first record. It can be reproduced with the "PetStore2000.mdb" or "PetStore2002.mdb" sample databases found at this link: https://jerrypost.com/dbbook/mainbook/Databases/Access/Access.html (I've been able to read the other sample databases at that link with meza successfully)

Example code:

import meza.io
print("Detected encoding:", meza.io.get_encoding('PetStore2002.mdb'))
records = meza.io.read_mdb('PetStore2002.mdb')
print(next(records))

Output:

Detected encoding: None

Traceback (most recent call last):
  File "/Users/me/Library/Application Support/JetBrains/PyCharm2022.3/scratches/scratch_84.py", line 6, in <module>
    print(next(records))
  File "/Users/me/projects/leaks-tools/splitters/venv/lib/python3.8/site-packages/meza/io.py", line 664, in read_mdb
    first_line = StringIO(str(pipe.readline()))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 268: invalid start byte

It's possible this is the same problem described in https://github.com/reubano/meza/issues/39