Closed mrx23dot closed 2 years ago
This is my suggestion, parsing time 1.2s!
# -*- coding: utf-8 -*-
import requests
requests.packages.urllib3.disable_warnings()
# pip install cchardet lxml beautifulsoup4 requests
# speed up BeautifulSoup only by installing cchardet
from bs4 import BeautifulSoup,SoupStrainer
url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
#print(resp.text)
# requires utf-8 in file header
# !only parse specific parts for speed
target_tags = SoupStrainer('ix:header')
soup = BeautifulSoup(resp.text, 'lxml', parse_only=target_tags)
for i in soup.find_all('xbrli:startDate'.lower()):
print(i.text)
Currently py-xbrl uses ElementTree for parsing xml. At that time I deliberately decided against BeautifulSoup for two reasons:
However, I will take a look at how you achieved the speeding up of the parsing time later this week,
Additionally I do not really understand why you are only searching for the ix:header
XML Element. The entire document can have Facts (ix:nonFraction
) at any level and in every line of the document. In your code snippet you just ignore all facts that are outside of the ix:header
XML Element.
As I read the lxml is the fastest C based parser. This BeautifulSoup is just an API around lxml (and others).
There is also an eTree API for lxml: https://lxml.de/tutorial.html Not sure if it allows filtering for specific tags to parse only.
The current eTree implementation uses python based regexp, based on profiling, that's why it's so slow.
I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.
I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.
Thats wrong. The majority of the XBRL-facts are in the body of the HTML document. The ix:hidden
element only contians facts that should not be displayed in the HTML report visible to the normal user. In the case of sec submissions the hidden facts are usually the facts that are tagged with the dei taxonomy. These facts contain meta information about the document itself.
All other financial XBRL-facts (like those from the balance sheet) are scattered around the entire HTML document!
Additionally you have to concider that px-xbrl
not only parses the instance document, but also all taxonomy schemas and linkbases that the report depends on.
An example: If you give the parser the following Instance Document: https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm
The parser will download and parse the following files: https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231.xsd http://www.xbrl.org/2003/xbrl-instance-2003-12-31.xsd http://www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd http://www.xbrl.org/2003/xl-2003-12-31.xsd http://www.xbrl.org/2003/xlink-2003-12-31.xsd http://www.xbrl.org/2005/xbrldt-2005.xsd https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd http://www.xbrl.org/dtr/type/nonNumeric-2009-12-16.xsd https://xbrl.sec.gov/currency/2020/currency-2020-01-31.xsd https://xbrl.sec.gov/dei/2019/dei-2019-01-31.xsd http://www.xbrl.org/dtr/type/numeric-2009-12-16.xsd https://xbrl.sec.gov/exch/2020/exch-2020-01-31.xsd http://www.xbrl.org/lrr/arcrole/factExplanatory-2009-12-16.xsd http://www.xbrl.org/lrr/role/negated-2009-12-16.xsd http://www.xbrl.org/lrr/role/net-2009-12-16.xsd https://xbrl.sec.gov/naics/2017/naics-2017-01-31.xsd https://xbrl.sec.gov/sic/2020/sic-2020-01-31.xsd http://xbrl.fasb.org/srt/2020/elts/srt-2020-01-31.xsd http://www.xbrl.org/2006/ref-2006-02-27.xsd http://xbrl.fasb.org/srt/2020/elts/srt-types-2020-01-31.xsd http://xbrl.fasb.org/srt/2020/elts/srt-roles-2020-01-31.xsd https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-types-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-roles-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-eedm-def-2020-01-31.xml http://xbrl.fasb.org/srt/2020/elts/srt-eedm1-def-2020-01-31.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_cal.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_def.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_lab.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_pre.xml
Your code example above does not touch these taxonomy schemas and linkbases. This is also one reason why your code executes much faster.
Here is a short explaination of taxonomies and linkbases: https://manusimidt.dev/2021-07/xbrl-explained
Still very fast, should be fully compatible with current code base:
from lxml import etree
import requests
requests.packages.urllib3.disable_warnings()
#download
url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
file_in_bytearray = bytes(resp.text, encoding='utf-8')
# parse
root = etree.XML(file_in_bytearray)
for i in root:
print(i)
I made same initial progress with integrating lxml, see branch https://github.com/mrx23dot/py-xbrl/tree/lxml
got the namespace map, and etree root, but fails at root.find('.//{}schemaRef'.format(LINK_NS)) returns None
Another thing I have noticed with the non optimized etree that RAM usage jumps up to 500-1000MB while parsing.
I have done the integration of lxml. Turns out it isn't the bottleneck :( It's the simple compare_uri(uri1: str, uri2: str) function.
Anyway we could eliminate or reduce the number of calls to it? It's called half a million times, for one filling, each call has 2 regexp. And 99.9% of time it returns false. And only ~136 times it's called with different value. Called by get_taxonomy(url) which is a big recursion.
Can't we replace recursion with 136 flat calls?
Done in https://github.com/manusimidt/py-xbrl/pull/70 https://github.com/manusimidt/py-xbrl/pull/68
Final result: 11secs-> 0.907 seconds
MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.
ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp? There are 2120074 regexp calls, looks like every tag is searched this way. Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used) from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, 'html.parser')
python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt
Profiling result
The call stack to get to the bottleneck: