manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
100 stars 37 forks source link

xml parsing errors #38

Closed mrx23dot closed 3 years ago

mrx23dot commented 3 years ago

Lib throws exception on parsing some (new) ixbrl fillings. (list below) Not sure how SEC tolerates these and what they store in their xml.

lxml==4.6.3 py-xbrl==2.0.2

inst = xbrlParser.parse_instance(url)

Traceback (most recent call last):
  File "parse_sec.py", line 391, in <module>
    resultDict = parse_xml(url, price)
  File "parse_sec.py", line 356, in parse_xml
    flatDict = _get_raw_data(url)
  File "parse_sec.py", line 61, in _get_raw_data
    inst = xbrlParser.parse_instance(url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 626, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: mismatched tag: line 15, column 172

e.g. all of these:

https://www.sec.gov/Archives/edgar/data/0000017313/000001731321000075/cswc3312110-k.htm https://www.sec.gov/Archives/edgar/data/0000027093/000149315221006145/form10-q.htm https://www.sec.gov/Archives/edgar/data/0000704562/000168316821000810/avid_10q-013121.htm https://www.sec.gov/Archives/edgar/data/0001009759/000155837021008246/cgrn-20210331x10k.htm https://www.sec.gov/Archives/edgar/data/0001015383/000149315221003699/form10-q.htm https://www.sec.gov/Archives/edgar/data/0001041368/000093905721000189/10k33121.htm https://www.sec.gov/Archives/edgar/data/0001278752/000127875221000017/ainv2021q410-k.htm https://www.sec.gov/Archives/edgar/data/0001304492/000130449221000018/atex-20210331x10k.htm https://www.sec.gov/Archives/edgar/data/0001321741/000119312521157561/d409757d10k.htm https://www.sec.gov/Archives/edgar/data/0001348911/000156459021012500/kalv-10q_20210131.htm https://www.sec.gov/Archives/edgar/data/0001377936/000121390021024682/f10k2021_saratogainvest.htm https://www.sec.gov/Archives/edgar/data/0001409375/000156459021031193/oesx-10k_20210331.htm https://www.sec.gov/Archives/edgar/data/0001411685/000165495421001538/vtgn10q_dec312020.htm https://www.sec.gov/Archives/edgar/data/0001491419/000121390021009670/f10q1220_livexlivemedia.htm https://www.sec.gov/Archives/edgar/data/0001504678/000165495421006409/lp_10k.htm https://www.sec.gov/Archives/edgar/data/0001532390/000106299321001520/form10q.htm https://www.sec.gov/Archives/edgar/data/0001641631/000149315221014050/form10-k.htm https://www.sec.gov/Archives/edgar/data/0001696558/000121390021033774/f10k2021_jerashhold.htm https://www.sec.gov/Archives/edgar/data/0001721741/000149315221006447/form10-k.htm https://www.sec.gov/Archives/edgar/data/0001756497/000119312521175345/d156422d10k.htm

manusimidt commented 3 years ago

Please note that py-xbrl only parses XBRL Documents. The files you provided are just regular HTML files and do not follow the XBRL Standard. You can check if a file follows the iXBRL Standard either manually by looking at the Index file of the submission or programmatically i.e via the Structured Disclosure RSS Feeds.

The reason why the parser is crashing is that the SEC appends a file header to these documents, which can't be processed by the XML Parsing libary py-xbrl uses. image

manusimidt commented 3 years ago

However, perhaps a better error message could be issued, indicating that you should check whether the given file is really an XBRL file 🤔.

mrx23dot commented 3 years ago

SEC told me it wasn't mandatory for small companies to include inline xbrl in the past, but it will be from June 15, 2021. Which means this lib won't cover every historical cases, and an html greper is also required.

Yeah a simple error msg would be nice that this file doesn't have inline xbrl. Then we can fall back to html greper.

manusimidt commented 3 years ago

Not exactly. Since 2009 it is mandatory for every company (with assets over 10 Mio USD) to publish the 10-K and 10-Q in XBRL. In 2019 the SEC began to slowly transition from regular XBRL to inline XBRL (iXBRL).

So yes, there are currently some small companies that still don't file inlineXBRL files (html), but they usually append a seperate XBRL Instance Document (xml) to their submission. This libary can parse both XBRL and inline XBRL documents.

So instead of the original html filing document: https://www.sec.gov/Archives/edgar/data/1641631/000149315221014050/form10-k.htm use the XBRL instance document that was submitted with the filing: https://www.sec.gov/Archives/edgar/data/1641631/000149315221014050/xair-20210331.xml

mrx23dot commented 3 years ago

What about this 10-K (2021-05-26)? https://www.sec.gov/Archives/edgar/data/0000017313/000001731321000075/0000017313-21-000075-index.htm

It doesn't have any xml among filling files, neither inline xbrl, only plain html.

manusimidt commented 3 years ago

Correct, this submission does not contain any XBRL tagging. Maybe there is an exception rule for this company that they do not have to file in iXBRL. You can find the original legislation regarding which company has to file in XBRL in RIN 3235-AJ71 page 43 and following.