uktrade / stream-read-xbrl

Python package to parse Companies House accounts data in a streaming way
https://stream-read-xbrl.docs.trade.gov.uk/
MIT License
17 stars 2 forks source link

Broken XML files result in empty record #168

Closed dbrojas closed 2 months ago

dbrojas commented 5 months ago

From the records that stream_read_xbrl_sync() returns, I've found some records that seem to be empty. Except for the columns run_code, company_id, date, file_type, taxonomy and zip_url, all values are NULL.

After investigating I've found that the source files pertaining to these records have broken XML.

One example would be for company_id = 00728497 on date 2007-07-31. Below is the code to reproduce:

import itertools
import datetime
from stream_read_xbrl import stream_read_xbrl_sync

if __name__ == '__main__':
    with stream_read_xbrl_sync(datetime.date(2007, 5, 31)) as (columns, date_range_and_rows):
        for ((start_date, end_date), rows) in date_range_and_rows:
            row_with_issue = next(itertools.islice(rows, 1476, None))
            break

print(row_with_issue)
# >>> ('Prod224_8998', '00728497', datetime.date(2007, 7, 31), 'xml', '', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'https://download.companieshouse.gov.uk/archive/Accounts_Monthly_Data-JanuaryToDecember2008.zip')

The XML file from which this record is parsed does contain actual accounts information, but it looks like it has a lot of newlines and whitespace:

% head ~/Downloads/Accounts_Monthly_Data-JanuaryToDecember2008/Prod224_8998_00728497_20070731.xml
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="http://www.companieshouse.gov.uk/

ef/xbrl/uk/fr/gaap/ae/2008-04-06/stylesheet/CH-AE-stylesheet.xsl" ?>

<xbrl xmlns='http://www.xbrl.org/2003/instance' xmlns:xsi='http://www.w3

Compared to a healthy example:

% head -3 ~/Downloads/Accounts_Monthly_Data-JanuaryToDecember2008/Prod224_8998_00728572_20080331.xml
<?xml version="1.0"?>
<?xml-stylesheet href="http://www.companieshouse.gov.uk/ef/xbrl/uk/fr/gaap/ae/2008-04-06/stylesheet/CH-AE-dormant-stylesheet.xsl" type="text/xsl"?>
        <xbrl xmlns="http://www.xbrl.org/2003/instance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ae="http://www.companieshouse.gov.uk/ef/xbrl/uk/fr/gaap/ae/2008-04-06" xmlns:gc="http://www.xbrl.org/uk/fr/gcd/2004-12-01" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:pt="http://www.xbrl.org/uk/fr/gaap/pt/2004-12-01" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xdp="http://ns.adobe.com/xdp/" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">

I suspect this causes the parser to yield an empty record.

Is this issue known? Are there a lot of cases where the XML is broken?

michalc commented 4 months ago

I suspect this causes the parser to yield an empty record.

That's right. (And I have to admit, I didn't know what it should do in this case. Suggestions are welcome)

Is this issue known? Are there a lot of cases where the XML is broken?

My very rough impression is somewhere around 10 in total. That's really is just am impression based on fixing errors while processing all the historical Companies House data. I never strictly counted it.

michalc commented 3 months ago

Will close this issue soon... but if there is some action to take/further questions to answer, please say.

ygalanak commented 1 month ago

Not sure if it is know, but I have come across several broken XML in 2014 and 2015 historical accounts.