Parsing <TEXT> fails - Githubissues

mrx23dot commented 2 years ago

Parsing of https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm

causes exception in XbrlParser(cache).parse_instance(url) Saying: not well-formed (invalid token): line 7, column 2 Thus most likely also other fillings from the same company.

SEC's response:

Please look at the contents of the link. You will see that like every other one of the millions of HTML documents on the EDGAR site, the first six lines are document metadata in SGML, that a browser ignores. They look like this:

<DOCUMENT>
<TYPE>10-Q
<SEQUENCE>1
<FILENAME>mtcr-10q_20200930.htm
<DESCRIPTION>10-Q
<TEXT>
 Programs can start parsing after the <TEXT> line and also ignore the last two lines
 </TEXT>
</DOCUMENT>

trace

  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 2

manusimidt commented 2 years ago

Hello, please make sure to only parse documents that follow the XBRL or iXBRL specification. The document https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm is a normal HTML document without any XBRL stuff. Use the Instance Document of this submission for extracting data with py-xbrl.

mrx23dot commented 2 years ago

Ah, this is just a different error indicating non ixbrl file.

I could add a pre-check in the lxml implementation that would filter this out. Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid? Or as a warning to console that possibly invalid.

manusimidt commented 2 years ago

I am not entirely sure what you mean with the following statement:

Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?

for an ixbrl instance document to be valid, it must comply with the iXBRL specification.

This includes many validation rules. See for example the validation rules for the ix:nonFraction elements: https://www.xbrl.org/specification/inlinexbrl-part1/rec-2013-11-18/inlinexbrl-part1-rec-2013-11-18.html#d1e5415

manusimidt commented 2 years ago

But yes, you are right that it would be nice if the parser could check if a document contains valid xbrl taggings.

manusimidt / py-xbrl

Parsing <TEXT> fails #62