manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
111 stars 40 forks source link

integrate lxml as parser for speed #68

Closed mrx23dot closed 3 years ago

mrx23dot commented 3 years ago

produces same parsed output, just a lot faster Example speed decrease: 3.004 seconds -> 1.295 seconds works on xml and ixbrl

manusimidt commented 3 years ago

Could you provide the submissions you have used for your test (3.004 seconds -> 1.295 seconds).

It is still debatable whether the increased speed is really desirable if you have to add an external library for it. As I mentioned before, I didn't use lxml on purpose, because I wanted to build py-xbrl without third party dependencies.

While lxml is prepackaged on some linux systems, it can be a real pain to install it on Windows. lxml requires libxml2 and libxslt. Personally I had a lot of problems installing these libraries, because the installation process via PyPi always crashed. In the end I had to install lxml, libxml2 and libxslt manually via the unofficially binaries.

mrx23dot commented 3 years ago

I installed lxml from pip on win7/win10/aws debian without any problem, I think they have precompiled bins for x64/x86 architecture.

pip install lxml
Collecting lxml
  Using cached lxml-4.6.3-cp37-cp37m-win_amd64.whl (3.5 MB)
Installing collected packages: lxml
Successfully installed lxml-4.6.3
pip3 install lxml
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/cf/4d/6537313bf58fe22b508f08cf3eb86b29b6f9edf68e00454224539421073b/lxml-4.6.3-cp37-cp37m-manylinux1_x86_64.whl (5.5MB)
Installing collected packages: lxml
Successfully installed lxml-4.6.3

On https://www.sec.gov/Archives/edgar/data/0001046025/000104602520000037/hfwa-2019123110k.htm

before 6542935 function calls (6288737 primitive calls) in 5.823 seconds ncalls tottime percall cumtime percall filename:lineno(function) 991220 2.028 0.000 2.028 0.000 {method 'findall' of 're.Pattern' objects} 1 0.931 0.931 1.126 1.126 xml_parser.py:9(parse_file) <- reduced

after 6080699 function calls (5830565 primitive calls) in 4.627 seconds ncalls tottime percall cumtime percall filename:lineno(function) 991200 2.013 0.000 2.013 0.000 {method 'findall' of 're.Pattern' objects} 495600 0.467 0.000 3.419 0.000 uri_helper.py:58(compare_uri) 1 0.065 0.065 0.065 0.065 xml_parser.py:9(parse_file)

This change and the "taxonomy look up table" is a huge performance boost, I can use them from my branch, let me know if you need them in PR.

Cheers

manusimidt commented 2 years ago

@mrx23dot Thank you for all your contributions, suggestions and issues to the libary.

I would really like to put more time into developing this library. Unfortunately, I just can't do it at the moment, because I relocated and started a new (very demanding) study program in October.
If you want we can schedule a 30mins Zoom/Gmeet/Teams meeting and exchange some knowledge. Maybe I can also help you with the application of XBRL. If you are interested just contact me :)

mrx23dot commented 2 years ago

No worries, it's working great on my branch, with 10x speed improvement.

The next big thing would be connecting structural data to facts, so we could do for facts in balance_sheet That's too object oriented for me to touch :D

mrx23dot commented 2 years ago

Just a friendly reminder regarding latest check ins, that F-strings are only introduced in py3.6, thus you are dropping compatibility with older ones.