Closed sreeshas closed 8 years ago
I spent a little bit of time looking into various issues that depend on the parser (issues #107, #76, #92, #64). Here is the full story:
At one point, we were using BeautifulSoup with the lxml html parser ('lxml'). This is good enough to parse OFX but drops everything inside CDATA sections. So #64 began using the lxml xml parser ('xml' a.k.a. 'lxml-xml'), which handles CDATA correctly but is more strict and when it sees malformed xml it causes ofxparse to think that the document is empty.
My first instinct was #113, which first tries to use 'lxml-xml' and if that fails falls back to 'lxml', because I know that the built in python html parser ('html.parser', this PR) has been problematic in the past.
But the current BeautifulSoup documentation says that for Python >2.7.3 or >3.2.2, 'html.parser' is a good, lenient parser albeit slower than lxml. I have confirmed that it handles CDATA correctly.
My recommendation is to scrap #113 and go with this PR.
Pros:
Cons:
If no one objects within a few days, I'll merge this PR.
The tests continue to pass even after I revert #64, which is a further vote for simplicity of this approach.
Looking at all the related bugs, I think this warrants a new release?
This pull request is related to issue #107 and #76
Ran nosetests and all tests pass.