jseutter / ofxparse

Ofx file format parser for Python
http://sites.google.com/site/ofxparse/
MIT License
204 stars 121 forks source link

Cannot process UTF-8 files with characters outside the 256 range #169

Open lynxlynxlynx opened 2 years ago

lynxlynxlynx commented 2 years ago

This file can't be parsed whether read as text or as a binary. Notice the "č" character in NAME. I'm on linux, the default locale is utf-8 and the file was stored as such:

<!-- 
OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:UTF-8
CHARSET:NONE
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
-->

<OFX><SIGNONMSGSRSV1><SONRS><STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<DTSERVER>20220531164134</DTSERVER><LANGUAGE>ENG</LANGUAGE></SONRS></SIGNONMSGSRSV1>
<BANKMSGSRSV1><STMTTRNRS><TRNUID>0</TRNUID>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<STMTRS><CURDEF>EUR</CURDEF><BANKACCTFROM><BANKID>-1</BANKID>
<ACCTID>SI56020100355860373</ACCTID><ACCTTYPE>CHECKING</ACCTTYPE>
</BANKACCTFROM><BANKTRANLIST><DTSTART>20220506</DTSTART><DTEND>20220510</DTEND><STMTTRN>
<TRNTYPE>CHECK</TRNTYPE><DTPOSTED>20220510</DTPOSTED><DTUSER>20220510</DTUSER>
<TRNAMT>-70.49</TRNAMT><FITID>-1</FITID><NAME>Finančna uprava RS</NAME><MEMO>SI1930741929-80004</MEMO>
<REFNUM>16NAFNEB2FKU42TQ</REFNUM></STMTTRN></BANKTRANLIST><LEDGERBAL>
<BALAMT>48554.59</BALAMT><DTASOF>20220510000000</DTASOF></LEDGERBAL></STMTRS></STMTTRNRS>
</BANKMSGSRSV1></OFX>

Default reading as suggested by docs

Python 3.10.4 (main, Apr  2 2022, 09:04:19) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ofxparse
>>> file = open('/tmp/moj.ofx') # passing encoding="utf-8" doesn't change anything, as expected
>>> ofx = ofxparse.OfxParser.parse(file)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 396, in parse
    ofx_file = OfxPreprocessedFile(file_handle)
  File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 155, in __init__
    super(OfxPreprocessedFile, self).__init__(fh)
  File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 79, in __init__
    self.fh = six.BytesIO(six.b(self.fh.read()))
  File "/usr/lib/python3/dist-packages/six.py", line 644, in b
    return s.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u010d' in position 751: ordinal not in range(256)

Binary mode to skip this error:

>>> file = open('/tmp/moj.ofx', mode="rb")
>>> import ofxparse
>>> ofx = ofxparse.OfxParser.parse(file)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 396, in parse
    ofx_file = OfxPreprocessedFile(file_handle)
  File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 160, in __init__
    ofx_string = self.fh.read()
  File "/usr/lib/python3.10/codecs.py", line 504, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 751: ordinal not in range(128)

It insists on encoding as ascii or latin1. From a quick glance I don't see any of the tests using unicode, so this has likely been broken from the start.

dev590t commented 1 year ago

is related to #133