olemb / dbfread

Read DBF Files with Python
MIT License
220 stars 91 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 #61

Open perttikellomaki opened 2 years ago

perttikellomaki commented 2 years ago

I'm trying to read a database produced by an ancient version of the ACDsee photo manager program (don't ask). When I try to read it simply as:

table = DBF('asset.dbf')
for record in table:
    print(record)

I get ValueError: Unknown field type: '7'. I followed the advice in another issue and created a field parser as:

class TestFieldParser(FieldParser):
    def parse7(self, field, data):
        return data

table = DBF('asset.dbf', parserclass=TestFieldParser)
for record in table:
    print(record)

This produces the stack trace below. Googling for the error suggests that maybe the file is being read with the wrong encoding. Is there an easy way to try reading e.g. with UTF-8?

Traceback (most recent call last):
  File "/mnt/acdsee/ACDsee/./dumpdb.py", line 10, in <module>
    for record in table:
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 314, in _iter_records
    items = [(field.name,
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/dbf.py", line 315, in <listcomp>
    parse(field, read(field.length))) \
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 79, in parse
    return func(field, data)
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 87, in parseC
    return self.decode_text(data.rstrip(b'\0 '))
  File "/mnt/acdsee/ACDsee/venv/lib/python3.9/site-packages/dbfread/field_parser.py", line 45, in decode_text
    return decode_text(text, self.encoding, errors=self.char_decode_errors)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35: character maps to <undefined>
shawnbrown commented 2 years ago

I'm not sure if I have an answer for you but have you tried specifying the encoding?

If you know what encoding you have, you might be able to get it to work this way. The DBF() constructor's optional second argument is for an encoding:

table = DBF('asset.dbf', 'UTF-8')
for record in table:
    print(record)
table = DBF('asset.dbf', 'Latin-1')
for record in table:
    print(record)

Another option is to set the char_decode_errors handler. The argument defaults to 'strict' when unspecified.

So this...

table = DBF('asset.dbf')

Is the same as...

table = DBF('asset.dbf', char_decode_errors='strict')

But you could relax this requirement by specifying a more forgiving error handler (see Python's Error Handlers docs for more options):

table = DBF('asset.dbf', char_decode_errors='replace')

You might settle on some combination of the two... defining an expected encoding and loosening the error handling behavior:

table = DBF('asset.dbf', 'Latin-1', char_decode_errors='replace')
for record in table:
    print(record)