Closed mrx23dot closed 2 years ago
Hello, please make sure to only parse documents that follow the XBRL or iXBRL specification.
The document https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm is a normal HTML document without any XBRL stuff.
Use the Instance Document of this submission for extracting data with py-xbrl
.
Ah, this is just a different error indicating non ixbrl file.
I could add a pre-check in the lxml implementation that would filter this out. Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid? Or as a warning to console that possibly invalid.
I am not entirely sure what you mean with the following statement:
Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?
for an ixbrl instance document to be valid, it must comply with the iXBRL specification.
This includes many validation rules.
See for example the validation rules for the ix:nonFraction
elements:
https://www.xbrl.org/specification/inlinexbrl-part1/rec-2013-11-18/inlinexbrl-part1-rec-2013-11-18.html#d1e5415
But yes, you are right that it would be nice if the parser could check if a document contains valid xbrl taggings.
Parsing of https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm
causes exception in XbrlParser(cache).parse_instance(url) Saying: not well-formed (invalid token): line 7, column 2 Thus most likely also other fillings from the same company.
SEC's response:
Please look at the contents of the link. You will see that like every other one of the millions of HTML documents on the EDGAR site, the first six lines are document metadata in SGML, that a browser ignores. They look like this:
trace