jseutter / ofxparse

Ofx file format parser for Python
http://sites.google.com/site/ofxparse/
MIT License
204 stars 121 forks source link

Change BeautifulSoup parser from xml to html.parser #108

Closed sreeshas closed 8 years ago

sreeshas commented 8 years ago

This pull request is related to issue #107 and #76

Ran nosetests and all tests pass.

nathangrigg commented 8 years ago

I spent a little bit of time looking into various issues that depend on the parser (issues #107, #76, #92, #64). Here is the full story:

At one point, we were using BeautifulSoup with the lxml html parser ('lxml'). This is good enough to parse OFX but drops everything inside CDATA sections. So #64 began using the lxml xml parser ('xml' a.k.a. 'lxml-xml'), which handles CDATA correctly but is more strict and when it sees malformed xml it causes ofxparse to think that the document is empty.

My first instinct was #113, which first tries to use 'lxml-xml' and if that fails falls back to 'lxml', because I know that the built in python html parser ('html.parser', this PR) has been problematic in the past.

But the current BeautifulSoup documentation says that for Python >2.7.3 or >3.2.2, 'html.parser' is a good, lenient parser albeit slower than lxml. I have confirmed that it handles CDATA correctly.

My recommendation is to scrap #113 and go with this PR.

Pros:

Cons:

If no one objects within a few days, I'll merge this PR.

nathangrigg commented 8 years ago

The tests continue to pass even after I revert #64, which is a further vote for simplicity of this approach.

StefanRijnhart commented 7 years ago

Looking at all the related bugs, I think this warrants a new release?