Not well-formed (invalid token) error for ixblr.

gety9 commented 2 years ago

Guy, hi

Great project! Learnt a lot about xbrl from discussions and blog posts.

Remote xblr example worked for me, remote ixblr didn't (i think it's cause url provided is regular htm, not ixblr), but even with ixblr document it does not work cause it can't create cache file.

I think issue is ?= in path name. I tested on windows and termux (linux on android) In both cases there are enough permissions, and xblr example works.

Details: 1, command executed

# inline xblr

import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache')

cache.set_headers({'From': 'your.name@company.com', 'User-Agent': 'Tool/Version (Website)'})
xbrlParser = XbrlParser(cache)

ixbrl_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)

2, windows error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\instance.py", line 653, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\instance.py", line 362, in parse_ixbrl_url
    instance_path: str = cache.cache_file(instance_url)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\cache.py", line 83, in cache_file
    os.makedirs(file_dir_path)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 2 more times]
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 223, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: './cache/www.sec.gov/ix?doc='

path was created till ix?doc= (basically cache/www.sec.gov)

3, termux error

PermissionError                           Traceback (most recent call last)
Input In [29], in <module>
     11 xbrlParser = XbrlParser(cache)
     13 ixbrl_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
---> 14 inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:653, in XbrlParser.parse_instance(self, url)
    651 if url.split('.')[-1] == 'xml' or url.split('.')[-1] == 'xbrl':
    652     return parse_xbrl_url(url, self.cache)
--> 653 return parse_ixbrl_url(url, self.cache)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:362, in parse_ixbrl_url(instance_url, cache)
    351 def parse_ixbrl_url(instance_url: str, cache: HttpCache) -> XbrlInstance:
    352     """
    353     Parses a inline XBRL (iXBRL) instance file.
    354     :param cache: HttpCache instance
   (...)
    360     :return:
361     """
--> 362     instance_path: str = cache.cache_file(instance_url)
    363     return parse_ixbrl(instance_path, cache, instance_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/cache.py:94, in HttpCache.cache_file(self, file_url)
     90     else:
     91         raise Exception(
     92             "Could not download file from {}. Error code: {}".format(file_url, query_response.status_code))
---> 94 with open(file_path, "wb+") as file:
     95     file.write(query_response.content)
     96     file.close()
PermissionError: [Errno 1] Operation not permitted: './cache/www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'

path was created till ttd (basically file creation failed)

gety9 commented 2 years ago

Update: apparently one can pass ixblr using ix?doc= url or removing it (using clean url), problem neither of them works now. Clean url was working previously https://github.com/manusimidt/py-xbrl/issues/69#issuecomment-1035143840

mrx23dot commented 2 years ago

This format is an interactive view: https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm

you have to go to Menu/open as Html https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm

not all fillings have iXBRL content, thus not all parsable.

gety9 commented 2 years ago

@mrx23dot thanks for reply, yes now i am trying to use 2nd link, but it's not working too.

According to issue 69 "clean" ixbrl links work for you, could you please tell what i am doing wrong?

I am running

import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance

logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache')

cache.set_headers({'From': 'myemail@mail.com', 'User-Agent': 'Tool/Version (Website)'})
xbrlParser = XbrlParser(cache)

ixbrl_url = 'https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)

but all ixbrl links ("clean" ones) i tested result in same error "not well formed", only line # and column # are different.

Traceback (most recent call last):

  File /data/data/com.termux/files/usr/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3251 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Input In [2] in <module>
    inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)

  File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:653 in parse_instance
    return parse_ixbrl_url(url, self.cache)

  File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:363 in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)

  File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:383 in parse_ixbrl
    root: ET = parse_file(instance_path)

  File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/helper/xml_parser.py:19 in parse_file
    for event, elem in ET.iterparse(path, events):

  File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1254 in iterator
    yield from pullparser.read_events()

  File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1329 in read_events
    raise event

  File /data/data/com.termux/files/usr/lib/python3.10/xml/etree/ElementTree.py:1301 in feed
    self._parser.feed(data)

File <string>
ParseError: not well-formed (invalid token): line 13, column 131

I tested on windows and termux (linux), same error. So it's either:

-, i am doing smng wrong -, there was package update that introduced this error (cause it worked for you) -, sec made change

manusimidt commented 2 years ago

Hey, regarding the first issue @mrx23dot is right. Be careful that you do not insert the path to the inline xbrl viewer application of the SEC. py-xbrl needs the direct path to the ixbrl file. In order to get this path programmatically, I would recommend looking at the monthly Structured Disclosure RSS Feeds.

In the second code example, you used the library exactly right. The issue is that, for some reason, the iXBRL file does not contain valid html and thus the Element Tree parser (the XML parsing library py-xbrl uses) can not parse the file.

After looking at the file I noticed that there is indeed a really strange script tag that has an attribute without a value. I suspect that ElementTree fails to parse the file because of the word "defer".

manusimidt commented 2 years ago

After doing two minutes of research I noticed that the word defer is indeed valid HTML and is used to manipulate the downloading behavior of the external script [1].

However, I am not sure why Element tree is not able to parse the file then🤔.

[1] https://www.w3schools.com/tags/att_script_defer.asp

manusimidt commented 2 years ago

Ok, I think it's now clear to me. While the script tag is valid HTML it is not compatible with the XML Specification. Since py-xbrl is uses the xml.etree.ElementTree which is a pure XML parser it fails when parsing this submission.

A quick and dirty fix would be to just eliminate all script tags before passing the raw file content to the XML parser. (This information is irrelevant for the XBRL parsing anyway...).

A better solution would be to use a proper HTML parser for iXBRL. The reason why I wanted to use the same parsing library for XBRL and iXBRL was so that I could reuse a lot of code from both modules.

Thank you for your issue, I will think about that.

mrx23dot commented 2 years ago

@gety9 Your better bet is to parse the xml instead of html:

https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml

list fillings first: https://www.sec.gov/edgar/browse/?CIK=1671933&owner=exclude

then open Filling https://www.sec.gov/Archives/edgar/data/0001671933/000156459021006726/0001564590-21-006726-index.htm

then parse xml from here https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml

manusimidt commented 2 years ago

This issue applies to far more filings than I initially anticipated. The script tag can be found in many newer submissions. I will try to roll out a fix by the end of next week.

gety9 commented 2 years ago

This issue applies to far more filings than I initially anticipated. The script tag can be found in many newer submissions. I will try to roll out a fix by the end of next week.

@manusimidt Manuel, i would be great, cause now none of ixblr links seem to work (even the one from read.me), seems like sec rolled out this tag for all ixblr files.

gety9 commented 2 years ago

@gety9 Your better bet is to parse the xml instead of html:

https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml

list fillings first: https://www.sec.gov/edgar/browse/?CIK=1671933&owner=exclude

then open Filling https://www.sec.gov/Archives/edgar/data/0001671933/000156459021006726/0001564590-21-006726-index.htm

then parse xml from here https://www.sec.gov/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231_htm.xml

yes, xblr works for all sec fillings i tested, and still i agree with Manuel that ixblr is preferred source cause that's the one company files (xblr is generated by sec from ixblr).

manusimidt commented 2 years ago

I have found a solution to this problem. However, I am not completely satisfied with the solution. Originally I wanted to use lxml.html.clean.Cleaner. But for some reason, the cleaner also removed many XML tags.

Additionally, I noticed that some newer SEC submissions utilize the XII Transformation Registry 4. I developed the current transformation registry converter in py-xbrl only with the XII Transformation Registry 3 in mind and the newer transformation functions break the code, unfortunately.

Since implementing all transformation functions defined in XII Transformation Registry 4 would take way too long (the newer registry also allows for foreign language conventions...) I will try to implement the most important transformations (#75) and just ignore all others. After that, I will create a new version and publish it to pypi.

manusimidt / py-xbrl

Not well-formed (invalid token) error for ixblr. #78