manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
100 stars 37 forks source link

Slow parsing on some fillings #56

Closed mrx23dot closed 2 years ago

mrx23dot commented 2 years ago

MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.

ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp? There are 2120074 regexp calls, looks like every tag is searched this way. Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used) from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, 'html.parser')

python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt

from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

dir = 'cache'
cache = HttpCache(dir)
# !Replace the dummy header with your information! SEC EDGAR require you to disclose information about your bot! (https://www.sec.gov/privacy.htm#security)
cache.set_headers({'From': 'test@gmail.com', 'User-Agent': 'revenue extactor v1.0'})
cache.set_connection_params(delay=1000/9.9, retries=5, backoff_factor=0.8, logs=True)

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
# same as zip:  https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/0001564590-21-002316-xbrl.zip

inst = XbrlParser(cache).parse_instance(url)

Profiling result

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2120074    5.464    0.000    5.464    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slowest part 5.5seconds
  1060027    1.244    0.000    8.874    0.000 uri_helper.py:58(compare_uri)
  2120054    0.861    0.000    7.029    0.000 re.py:214(findall)
531164/2886    0.810    0.000    9.684    0.003 taxonomy.py:170(get_taxonomy)
  2120160    0.703    0.000    0.728    0.000 re.py:286(_compile)
  2160290    0.622    0.000    0.622    0.000 {method 'split' of 'str' objects}
       31    0.193    0.006    0.193    0.006 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
        1    0.139    0.139    0.323    0.323 xml_parser.py:9(parse_file)
      316    0.136    0.000    0.136    0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
     25/1    0.127    0.005    2.553    2.553 taxonomy.py:219(parse_taxonomy)

The call stack to get to the bottleneck:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.646   10.646 xbrl_small_test.py:2(<module>)  <-- entry
        1    0.000    0.000   10.318   10.318 instance.py:644(parse_instance)
        1    0.024    0.024   10.318   10.318 instance.py:351(parse_ixbrl_url)
        1    0.016    0.016   10.293   10.293 instance.py:366(parse_ixbrl)
531164/2886    0.799    0.000    9.478    0.003 taxonomy.py:170(get_taxonomy)
  1060027    1.215    0.000    8.679    0.000 uri_helper.py:58(compare_uri)
  2120054    0.847    0.000    6.893    0.000 re.py:214(findall)
  2120074    5.345    0.000    5.345    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slow part
mrx23dot commented 2 years ago

This is my suggestion, parsing time 1.2s!

# -*- coding: utf-8 -*-

import requests
requests.packages.urllib3.disable_warnings()

# pip install cchardet lxml beautifulsoup4 requests
# speed up BeautifulSoup only by installing cchardet
from bs4 import BeautifulSoup,SoupStrainer

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
#print(resp.text)

# requires utf-8 in file header
# !only parse specific parts for speed
target_tags = SoupStrainer('ix:header')
soup = BeautifulSoup(resp.text, 'lxml', parse_only=target_tags)

for i in soup.find_all('xbrli:startDate'.lower()):
  print(i.text)
manusimidt commented 2 years ago

Currently py-xbrl uses ElementTree for parsing xml. At that time I deliberately decided against BeautifulSoup for two reasons:

However, I will take a look at how you achieved the speeding up of the parsing time later this week,

manusimidt commented 2 years ago

Additionally I do not really understand why you are only searching for the ix:header XML Element. The entire document can have Facts (ix:nonFraction) at any level and in every line of the document. In your code snippet you just ignore all facts that are outside of the ix:header XML Element.

mrx23dot commented 2 years ago

As I read the lxml is the fastest C based parser. This BeautifulSoup is just an API around lxml (and others).

There is also an eTree API for lxml: https://lxml.de/tutorial.html Not sure if it allows filtering for specific tags to parse only.

The current eTree implementation uses python based regexp, based on profiling, that's why it's so slow.

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

manusimidt commented 2 years ago

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

Thats wrong. The majority of the XBRL-facts are in the body of the HTML document. The ix:hidden element only contians facts that should not be displayed in the HTML report visible to the normal user. In the case of sec submissions the hidden facts are usually the facts that are tagged with the dei taxonomy. These facts contain meta information about the document itself.

All other financial XBRL-facts (like those from the balance sheet) are scattered around the entire HTML document!

Additionally you have to concider that px-xbrl not only parses the instance document, but also all taxonomy schemas and linkbases that the report depends on.

An example: If you give the parser the following Instance Document: https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm

The parser will download and parse the following files: https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231.xsd http://www.xbrl.org/2003/xbrl-instance-2003-12-31.xsd http://www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd http://www.xbrl.org/2003/xl-2003-12-31.xsd http://www.xbrl.org/2003/xlink-2003-12-31.xsd http://www.xbrl.org/2005/xbrldt-2005.xsd https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd http://www.xbrl.org/dtr/type/nonNumeric-2009-12-16.xsd https://xbrl.sec.gov/currency/2020/currency-2020-01-31.xsd https://xbrl.sec.gov/dei/2019/dei-2019-01-31.xsd http://www.xbrl.org/dtr/type/numeric-2009-12-16.xsd https://xbrl.sec.gov/exch/2020/exch-2020-01-31.xsd http://www.xbrl.org/lrr/arcrole/factExplanatory-2009-12-16.xsd http://www.xbrl.org/lrr/role/negated-2009-12-16.xsd http://www.xbrl.org/lrr/role/net-2009-12-16.xsd https://xbrl.sec.gov/naics/2017/naics-2017-01-31.xsd https://xbrl.sec.gov/sic/2020/sic-2020-01-31.xsd http://xbrl.fasb.org/srt/2020/elts/srt-2020-01-31.xsd http://www.xbrl.org/2006/ref-2006-02-27.xsd http://xbrl.fasb.org/srt/2020/elts/srt-types-2020-01-31.xsd http://xbrl.fasb.org/srt/2020/elts/srt-roles-2020-01-31.xsd https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-types-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-roles-2020-01-31.xsd http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-eedm-def-2020-01-31.xml http://xbrl.fasb.org/srt/2020/elts/srt-eedm1-def-2020-01-31.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_cal.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_def.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_lab.xml https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_pre.xml

Your code example above does not touch these taxonomy schemas and linkbases. This is also one reason why your code executes much faster.

manusimidt commented 2 years ago

Here is a short explaination of taxonomies and linkbases: https://manusimidt.dev/2021-07/xbrl-explained

mrx23dot commented 2 years ago

Still very fast, should be fully compatible with current code base:

from lxml import etree
import requests
requests.packages.urllib3.disable_warnings()

#download 
url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
file_in_bytearray = bytes(resp.text, encoding='utf-8')

# parse
root = etree.XML(file_in_bytearray)
for i in root:
  print(i)
mrx23dot commented 2 years ago

I made same initial progress with integrating lxml, see branch https://github.com/mrx23dot/py-xbrl/tree/lxml

got the namespace map, and etree root, but fails at root.find('.//{}schemaRef'.format(LINK_NS)) returns None

Another thing I have noticed with the non optimized etree that RAM usage jumps up to 500-1000MB while parsing.

mrx23dot commented 2 years ago

I have done the integration of lxml. Turns out it isn't the bottleneck :( It's the simple compare_uri(uri1: str, uri2: str) function.

Anyway we could eliminate or reduce the number of calls to it? It's called half a million times, for one filling, each call has 2 regexp. And 99.9% of time it returns false. And only ~136 times it's called with different value. Called by get_taxonomy(url) which is a big recursion.

Can't we replace recursion with 136 flat calls?

mrx23dot commented 2 years ago

Done in https://github.com/manusimidt/py-xbrl/pull/70 https://github.com/manusimidt/py-xbrl/pull/68

Final result: 11secs-> 0.907 seconds