jseutter / ofxparse

Ofx file format parser for Python
http://sites.google.com/site/ofxparse/
MIT License
204 stars 121 forks source link

XMLParsedAsHTMLWarning #170

Open kantskernel opened 2 years ago

kantskernel commented 2 years ago

I see the following warning when inputting xml ofx file

XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor. warnings.warn(

Going through previous decisions and code behavior, I'm thinking it is intentional that HTML parser is used for XML (e.g. here)

I am thinking the warning shouldn't happen rather than me going in and specifying XML in the constructor - but might be misguided. Here is one more issue I saw related to this: https://github.com/EnergieID/entsoe-py/issues/180

My issue is not the same, I am actually using ofxparse in the context of beancount-reds-importers (FWIW)

Here's an example of what my input file is starting with:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?OFX OFXHEADER="200" VERSION="220"
redstreet commented 1 year ago

Ditto. Curious if other users aren't hitting this?

thehilll commented 1 year ago

I see this too.

jseutter commented 1 year ago

So yes, it was intentional to parse the XML this way. I don't recall this warning message appearing in the past, so one of the dependencies (BeautifulSoup?) must have added it. I can take a look at silencing the warning, or if someone else happens to look at it first, I'd be happy to review the change.

When I wrote this library parsing as XML would be too strict and parsing would fail, because SGML is a superset of XML. The HTML parser is more forgiving and just ignores the bits it doesn't understand.

redstreet commented 1 year ago

Thank you, @jseutter! I haven't looked at ofxparse, but this commit does exactly what is needed. I imagine you simply need to put it in the right file in ofxparse.