Closed dbrojas closed 2 months ago
I suspect this causes the parser to yield an empty record.
That's right. (And I have to admit, I didn't know what it should do in this case. Suggestions are welcome)
Is this issue known? Are there a lot of cases where the XML is broken?
My very rough impression is somewhere around 10 in total. That's really is just am impression based on fixing errors while processing all the historical Companies House data. I never strictly counted it.
Will close this issue soon... but if there is some action to take/further questions to answer, please say.
Not sure if it is know, but I have come across several broken XML in 2014 and 2015 historical accounts.
From the records that
stream_read_xbrl_sync()
returns, I've found some records that seem to be empty. Except for the columnsrun_code
,company_id
,date
,file_type
,taxonomy
andzip_url
, all values are NULL.After investigating I've found that the source files pertaining to these records have broken XML.
One example would be for company_id = 00728497 on date 2007-07-31. Below is the code to reproduce:
The XML file from which this record is parsed does contain actual accounts information, but it looks like it has a lot of newlines and whitespace:
Compared to a healthy example:
I suspect this causes the parser to yield an empty record.
Is this issue known? Are there a lot of cases where the XML is broken?