reubano / meza

A Python toolkit for processing tabular data
MIT License
416 stars 32 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file #39

Open ryszard314159 opened 4 years ago

ryszard314159 commented 4 years ago

I am trying to read Microsoft Access [.mdb] file (created by ChemFinder on Windows), but I am getting UnicodeDecodeError: 'utf-8' codec... error despite specifying encoding as recovered by meza.io.get_encoding() to be TIS-620

I would appreciate any suggestions...

Details below:

import meza
fn = 'test.mdb'
encoding = meza.io.get_encoding(fn)
print(enc) # TIS-620
records = meza.io.read_mdb(fn, encoding=enc)
z = list(records)
~/anaconda3/lib/python3.8/site-packages/meza/io.py in read_mdb(filepath, table, **kwargs)
    636     # https://stackoverflow.com/a/17698359/408556
    637     with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
--> 638         first_line = StringIO(str(pipe.readline()))
    639         names = next(csv.reader(first_line, **kwargs))
    640         uscored = ft.underscorify(names) if sanitize else names

~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 70: invalid start byte

I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

    pkwargs = {'stdout': PIPE, 'bufsize': 1, 'universal_newlines': True}
--> pkwargs['encoding'] = kwargs.get('encoding', None)

    # https://stackoverflow.com/a/2813530/408556
    # https://stackoverflow.com/a/17698359/408556
    with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:

but it does not resolve the issue. With this modification I am getting:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 327: character maps to <undefined>
reubano commented 2 years ago

Can you send me a file to test with?