jseutter / ofxparse

Ofx file format parser for Python
http://sites.google.com/site/ofxparse/
MIT License
204 stars 121 forks source link

Can't handle ofx 2.11 documents with <?xml declarations #92

Closed talwrii closed 8 years ago

talwrii commented 8 years ago

The parser complains about the document being empty.

This was true in 9d7b66e41ffb7857f94c4d691c67bf38ae03538fESC

I suspect that this is beautiful soup being rubbish. My experiences with beautiful soup have in general not been very positive. Is there an equivalent of lxml.etree.HTML for buggy xml? (Sorry about the F.U.D :/ - I'm too lazy to back up my statements with facts)

For reference my beautiful soup version is 3.2.1

Anyway, the following patch seemed to fix the issue but I don't fully understand what's going on, and I've reached my quota of yack shaving for today...

--- a/ofxparse/ofxparse.py
+++ b/ofxparse/ofxparse.py
@@ -191,8 +191,12 @@ class OfxPreprocessedFile(OfxFile):
                 tag_name = re.findall(r'(?i)<([a-z0-9_\.]+)>', token)[0]
                 if tag_name.upper() not in closing_tags:
                     last_open_tag = tag_name
-            new_fh.write(token)
+
+            if not is_processing_tag:
+                new_fh.write(token)
+
         new_fh.seek(0)
+        # Without the is_processing_tag, this shows
+        # that the *full* document is fed into BeautifulSoup
+        print new_fh.getvalue()
         self.fh = new_fh

Here's a sanitized document that consistently exhibits the bug

<?xml version="1.0" encoding="US-ASCII"?>
<?OFX OFXHEADER="200" VERSION="200" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>
<!-- Converted from: QIF -->
<!-- Date format was: DD/MM/YY -->
<OFX>
  <SIGNONMSGSRSV1>
    <SONRS>
      <STATUS>
        <CODE>0</CODE>
        <SEVERITY>INFO</SEVERITY>
        <MESSAGE>SUCCESS</MESSAGE>
      </STATUS>
      <DTSERVER>20151230</DTSERVER>
      <LANGUAGE>ENG</LANGUAGE>
      <FI>
        <ORG>UNKNOWN</ORG>
        <FID>UNKNOWN</FID>
      </FI>
    </SONRS>
  </SIGNONMSGSRSV1>
  <CREDITCARDMSGSRSV1>
    <CCSTMTTRNRS>
      <TRNUID>0</TRNUID>
      <STATUS>
        <CODE>0</CODE>
        <SEVERITY>INFO</SEVERITY>
        <MESSAGE>SUCCESS</MESSAGE>
      </STATUS>
      <CCSTMTRS>
        <CURDEF>USD</CURDEF>
        <CCACCTFROM>
          <ACCTID>UNKNOWN</ACCTID>
        </CCACCTFROM>
        <BANKTRANLIST>
          <DTSTART>20151203</DTSTART>
          <DTEND>20151230</DTEND>
          <STMTTRN>
            <TRNTYPE>DEBIT</TRNTYPE>
            <DTPOSTED>20151230</DTPOSTED>
            <TRNAMT>-3.45</TRNAMT>
            <FITID>UNKNOWN-CREDITCARD-20151230-3--3.45</FITID>
            <NAME>TESCO-STORES 2610</NAME>
          </STMTTRN>
        </BANKTRANLIST>
        <LEDGERBAL>
          <BALAMT>UNKNOWN</BALAMT>
          <DTASOF>20151230</DTASOF>
        </LEDGERBAL>
        <AVAILBAL>
          <BALAMT>UNKNOWN</BALAMT>
          <DTASOF>20151230</DTASOF>
        </AVAILBAL>
      </CCSTMTRS>
    </CCSTMTTRNRS>
  </CREDITCARDMSGSRSV1>
</OFX>
jaraco commented 8 years ago

I encountered a similar problem on BeautifulSoup 4.4.1 and reported it here, although as you can see, the issue seems to have resolved itself. Still, I see two major differences - you're using Python 2 and BeautifulSoup 3, so the issue is likely unrelated. Still, thought it was worth mentioning.

jaraco commented 8 years ago

When I encountered the error, it was with TestSuncorpBankStatement in the test suite.

nathangrigg commented 8 years ago

Fixed by #108