Change BeautifulSoup parser from xml to html.parser

sreeshas commented 8 years ago

This pull request is related to issue #107 and #76

Ran nosetests and all tests pass.

nathangrigg commented 8 years ago

I spent a little bit of time looking into various issues that depend on the parser (issues #107, #76, #92, #64). Here is the full story:

At one point, we were using BeautifulSoup with the lxml html parser ('lxml'). This is good enough to parse OFX but drops everything inside CDATA sections. So #64 began using the lxml xml parser ('xml' a.k.a. 'lxml-xml'), which handles CDATA correctly but is more strict and when it sees malformed xml it causes ofxparse to think that the document is empty.

My first instinct was #113, which first tries to use 'lxml-xml' and if that fails falls back to 'lxml', because I know that the built in python html parser ('html.parser', this PR) has been problematic in the past.

But the current BeautifulSoup documentation says that for Python >2.7.3 or >3.2.2, 'html.parser' is a good, lenient parser albeit slower than lxml. I have confirmed that it handles CDATA correctly.

My recommendation is to scrap #113 and go with this PR.

Pros:

Simpler
We can drop lxml as a dependency, which is a significant upside as getting lxml installed is not always straightforward.

Cons:

Slower, but I don't think that is an issue since OFX documents tend to be relatively bounded in length.
There may be other minor changes in parsing behavior that introduces new issues, but the tests that we have all pass.

If no one objects within a few days, I'll merge this PR.

nathangrigg commented 8 years ago

The tests continue to pass even after I revert #64, which is a further vote for simplicity of this approach.

StefanRijnhart commented 7 years ago

Looking at all the related bugs, I think this warrants a new release?

jseutter / ofxparse

Change BeautifulSoup parser from xml to html.parser #108