Closed gety9 closed 2 years ago
Update: apparently one can pass ixblr using ix?doc= url or removing it (using clean url), problem neither of them works now. Clean url was working previously https://github.com/manusimidt/py-xbrl/issues/69#issuecomment-1035143840
This format is an interactive view: https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm
you have to go to Menu/open as Html https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm
not all fillings have iXBRL content, thus not all parsable.
@mrx23dot thanks for reply, yes now i am trying to use 2nd link, but it's not working too.
According to issue 69 "clean" ixbrl links work for you, could you please tell what i am doing wrong?
I am running
import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache')
cache.set_headers({'From': 'myemail@mail.com', 'User-Agent': 'Tool/Version (Website)'})
xbrlParser = XbrlParser(cache)
ixbrl_url = 'https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)
but all ixbrl links ("clean" ones) i tested result in same error "not well formed", only line # and column # are different.
Traceback (most recent call last):
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3251 in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
Input In [2] in <module>
inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:653 in parse_instance
return parse_ixbrl_url(url, self.cache)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:363 in parse_ixbrl_url
return parse_ixbrl(instance_path, cache, instance_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:383 in parse_ixbrl
root: ET = parse_file(instance_path)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/helper/xml_parser.py:19 in parse_file
for event, elem in ET.iterparse(path, events):
File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1254 in iterator
yield from pullparser.read_events()
File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1329 in read_events
raise event
File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1301 in feed
self._parser.feed(data)
File <string>
ParseError: not well-formed (invalid token): line 13, column 131
I tested on windows and termux (linux), same error. So it's either:
-, i am doing smng wrong -, there was package update that introduced this error (cause it worked for you) -, sec made change
Hey, regarding the first issue @mrx23dot is right. Be careful that you do not insert the path to the inline xbrl viewer application of the SEC. py-xbrl needs the direct path to the ixbrl file. In order to get this path programmatically, I would recommend looking at the monthly Structured Disclosure RSS Feeds.
In the second code example, you used the library exactly right. The issue is that, for some reason, the iXBRL file does not contain valid html and thus the Element Tree parser (the XML parsing library py-xbrl uses) can not parse the file.
After looking at the file I noticed that there is indeed a really strange script tag that has an attribute without a value. I suspect that ElementTree fails to parse the file because of the word "defer".
After doing two minutes of research I noticed that the word defer is indeed valid HTML and is used to manipulate the downloading behavior of the external script [1].
However, I am not sure why Element tree is not able to parse the file then🤔.
Ok, I think it's now clear to me.
While the script tag is valid HTML it is not compatible with the XML Specification.
Since py-xbrl
is uses the xml.etree.ElementTree
which is a pure XML parser it fails when parsing this submission.
A quick and dirty fix would be to just eliminate all script tags before passing the raw file content to the XML parser. (This information is irrelevant for the XBRL parsing anyway...).
A better solution would be to use a proper HTML parser for iXBRL. The reason why I wanted to use the same parsing library for XBRL and iXBRL was so that I could reuse a lot of code from both modules.
Thank you for your issue, I will think about that.
@gety9 Your better bet is to parse the xml instead of html:
https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml
list fillings first: https://www.sec.gov/edgar/browse/?CIK=1671933&owner=exclude
then open Filling https://www.sec.gov/Archives/edgar/data/0001671933/000156459021006726/0001564590-21-006726-index.htm
then parse xml from here https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml
This issue applies to far more filings than I initially anticipated. The script tag can be found in many newer submissions. I will try to roll out a fix by the end of next week.
This issue applies to far more filings than I initially anticipated. The script tag can be found in many newer submissions. I will try to roll out a fix by the end of next week.
@manusimidt Manuel, i would be great, cause now none of ixblr links seem to work (even the one from read.me), seems like sec rolled out this tag for all ixblr files.
@gety9 Your better bet is to parse the xml instead of html:
https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml
list fillings first: https://www.sec.gov/edgar/browse/?CIK=1671933&owner=exclude
then open Filling https://www.sec.gov/Archives/edgar/data/0001671933/000156459021006726/0001564590-21-006726-index.htm
then parse xml from here https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml
yes, xblr works for all sec fillings i tested, and still i agree with Manuel that ixblr is preferred source cause that's the one company files (xblr is generated by sec from ixblr).
I have found a solution to this problem. However, I am not completely satisfied with the solution. Originally I wanted to use lxml.html.clean.Cleaner
. But for some reason, the cleaner also removed many XML tags.
Additionally, I noticed that some newer SEC submissions utilize the XII Transformation Registry 4. I developed the current transformation registry converter in py-xbrl
only with the XII Transformation Registry 3 in mind and the newer transformation functions break the code, unfortunately.
Since implementing all transformation functions defined in XII Transformation Registry 4 would take way too long (the newer registry also allows for foreign language conventions...) I will try to implement the most important transformations (#75) and just ignore all others. After that, I will create a new version and publish it to pypi.
Guy, hi
Great project! Learnt a lot about xbrl from discussions and blog posts.
Remote xblr example worked for me, remote ixblr didn't (i think it's cause url provided is regular htm, not ixblr), but even with ixblr document it does not work cause it can't create cache file.
I think issue is ?= in path name. I tested on windows and termux (linux on android) In both cases there are enough permissions, and xblr example works.
Details: 1, command executed
2, windows error
path was created till ix?doc= (basically cache/www.sec.gov)
3, termux error
path was created till ttd (basically file creation failed)