jseutter / ofxparse

Ofx file format parser for Python
http://sites.google.com/site/ofxparse/
MIT License
204 stars 121 forks source link

UTF-8 Encoding #133

Open nim-odoo opened 6 years ago

nim-odoo commented 6 years ago

Hello,

I am a bit confused by these lines: https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L85-L95 These are used to determine the headers of the file, and more particularly the encoding.

However, a typical header is:

<?xml version="1.0" encoding="UTF-8"?>
<?OFX OFXHEADER="200" VERSION="202" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>

This method won't parse the encoding="UTF-8" header. Therefore, the encoding will (always?) be considered as ASCII in https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L111-L117

You can imagine that if the OFX file contains non-ASCII characters, the reading will fail. As a workaround, I added encoding:UTF-8 on top of the file. But that seems odd...

Thanks cc @qdp-odoo

jseutter commented 6 years ago

I think you are correct. I don’t see any test cases that test this behavior.

Back when I started this library pretty much no producers were generating Ofx 2.2 and the code shows this. I wonder if nowadays it would be better to try parsing as xml first and if that errors out, switch to the sgml parsing. This would give us proper encoding detection for free for 2.2 and newer files.

At any rate, thanks for bringing this up. :)

-Jerry

alexis-via commented 3 years ago

I confirm that the bug is still present today in ofxparse 0.20 and in master.

My OFX files starts with

<?xml version="1.0" encoding="utf-8"?>
<?OFX OFXHEADER="200" VERSION="220" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>
<OFX>
  <SIGNONMSGSRSV1>

When I read the OFX 2.2.0 spec here https://www.ofx.net/downloads/OFX%202.2.pdf it says in section 2.2 page 34-35:

"The standard XML declaration must come first. This Processing Instruction (PI) includes an option to specify the version of XML being used, and may include options to show such things as the encoding declaration, and the standalone status of the document."

"The OFX declaration must come next in the file. This PI identifies the contents as an Open Financial Exchange file and provides the version number of the Open Financial Exchange declaration itself (not the version number of the contents). The Open Financial Exchange PI contains the following attributes: OFXHEADER VERSION SECURITY OLDFILEUID NEWFILEUID"

So it's clear that, with OFX 2.2, the encoding of the file must be read from the first line that should look like

<?xml version="1.0" encoding="UTF-8"?>

and not from the second line "<?OFX OFXHEADER=" that doesn't have an ENCODING attribute any more.

With the current version of ofxparse, my OFX file fails because it contains

<NAME>Direction Générale des Finances </NAME>

but, as the file is encoded in utf-8 but read as ascii by ofxparse, I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3636: ordinal not in range(128)

I'm ready to contribute to fix this bug. Any advice for the best way to fix it would be welcomed.

alexis-via commented 3 years ago

When looking at the test suite under the directory "tests/fixtures", I can find 4 files in OFX 2.x that start with

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

(anzcc.ofx, error_message.ofx, multiple_accounts2.ofx and multiple_accounts.ofx) but they only contain ASCII caracters, there aren't any accentuated character in them. That's why the tests work.

The first problem is to fix the method read_headers() https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L86 On an OFX 2.x file, this second line of code of this method will set head_data as an empty byte ; so, at the end of this method, self.headers is an empty OrderedDict. Then, the method handle_encoding() uses self.headers to read the encoding and fallsback on ascii.

My ideas: the current methods read_headers() and handle_encoding() work well for OFX 1.x but not for OFX 2.x. Should we try to patch them to work on both, or should we have separate methods for OFX 2.x and use some code to detect if the file is OFX 1.x or 2.x ?

alexis-via commented 3 years ago

I made a first prototype here: https://people.via.ecp.fr/~alexis/ofxparse-proto-bug-133.diff

This patch only changes the method read_headers(): if the file starts with '<?xml ', I use lxml to parse the ofx file and get the encoding. @jseutter What do you think about this approach ?

jseutter commented 3 years ago

I'm okay with this approach. Although I try to avoid adding dependencies, I think it is warranted in this case. I haven't run your code but the way I read it, it should still work for 1.x files. Is this the case?

If you have time to add a test case that highlights the broken (and now fixed) behavior I would be grateful. If not, I'll try to add one myself.

alexis-via commented 3 years ago

@jseutter Thanks for your feedback. It should work with ofx v1 and v2. I'll prepare a PR with a test with a v2 file.

dev590t commented 1 year ago

the PR is merged to master? I have the same problem

alexis-via commented 1 year ago

You can try my patch: https://people.via.ecp.fr/~alexis/ofxparse-proto-bug-133.diff

dev590t commented 1 year ago

The header of my ofx file is:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:UTF-8
CHARSET:NONE
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>...

and ofxparser fails on é char.

Thank you @alexis-via, After patching your patch, that give me other errors. I'm not sure if it is from my ofx file or no.

After modify the header to:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
...

The ofxparser runs without the patch.

Why don't submit your PR if the patch solve the issue for some case? The project seems unmaintained since 2021.