manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
111 stars 40 forks source link

Standardised Financial Data #103

Open firmai opened 1 year ago

firmai commented 1 year ago

How can this be used to develop standardised financial data, the tool looks promising but I am struggling to find good example, thanks so much for your work :)

manusimidt commented 1 year ago

Hey there,

the goal of this tool is certainly not to standardize financial data. This is basically the goal of the XBRL Standard itself. How well the data is standardized solely depends on the financial regulators and the creator of the XBRL document.

I guess your question is probably: "How can I use this tool to collect and compare data from different companies".

With py-xbrl you can basically extract any information that is tagged in an XBRL or iXBRL document. If you are not familiar with XBRL, maybe have a look at this iXBRL viewer. All values that are "clickable" are tagged with XBRL and can be read in with py-xbrl https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

i.e.: The following code extracts "Earning per share" from apple and Microsoft.

import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance

cache: HttpCache = HttpCache('./cache')
xbrlParser = XbrlParser(cache)

subs = {
    "AAPL": "https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm",
    "MSFT": "https://www.sec.gov/Archives/edgar/data/789019/000156459022035087/msft-10q_20220930.htm"
}

for ticker in subs.keys():
    inst: XbrlInstance = xbrlParser.parse_instance(subs[ticker])

    for fact in inst.facts:
        if fact.concept.name == 'EarningsPerShareBasic':
            print(f"On {fact.context.end_date} {ticker} had an EPS of {fact.value}")

output:

On 2022-09-24 AAPL had an EPS of 6.15
On 2021-09-25 AAPL had an EPS of 5.67
On 2020-09-26 AAPL had an EPS of 3.31

On 2022-09-30 MSFT had an EPS of 2.35
On 2021-09-30 MSFT had an EPS of 2.73
On 2020-09-26 AAPL had an EPS of 3.31

With py-xbrl you can extract thousands of different facts from thousand of companies directly from the source (the actual financial report from the company) instead of going through an API.

firmai commented 1 year ago

Pretty damn cool, what would be the difference between what you are doing and what Ties de kok did with https://github.com/TiesdeKok/fast_xbrl_parser

It seems that you are parsing the htm file https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

And that he is parsing the xml file: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231_def.xml

Do you know if these datasets are meant to contain the same information (facts/concepts). I wonder what would be the advantage, disadvantage of using one over the other.

rayniervanegmond commented 1 year ago

Hi All,

I think that in this context it would be good to point out the difference between two commonly used forms of XBRL.

The former is a "pure xml based standard" where both the instance "document" (the container for the facts) and the taxonomy are represented in XML. The latter --a newer standard" represents the instance "document" in the XHTML format while the support taxonomy is still represented in the XML format.

The reason for this was to support a more "styled" representation of the facts document to make it human-readable and machine readable at the same time.

Note that the required submission format across the globe is now pretty much inline XBRL (hence the APPLE filing in the example shown). The second example shown is part of the taxonomy (the definition linkbase) which will always be in the XML format. The filings in Europe have always been in the inline XBRL format while in the US the first 15 years or so were fully in the XBRL XML format but have now switched to the inline XBRL standard. Also many of the new regulatory filing requirements (such as the ESG filings in the EU and (soon) USA will be done in inline XBRL). Remember that the inline XBRL standard is based on the XBRL V2.1 standard so any efforts spent on XBRL code and process development will be used for inline XBRL processing.

To process a filing/submission the processor needs to be able to read/process the inline XHTML instance document to extract the facts and fact-metadata from the file and the associated XML taxonomy files.

Hope this clarifies things a little.

Please ping me if I should clarify more.

Cordially Raynier van Egmond (https://www.linkedin.com/in/rayniervanegmond/) @.***

On Tue, Feb 7, 2023 at 5:51 AM Derek Snow @.***> wrote:

Pretty damn cool, what would be the difference between what you are doing and what Ties de kok did with https://github.com/TiesdeKok/fast_xbrl_parser

It seems that you are parsing the htm file https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

And that he is parsing the xml file: https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231_def.xml

Do you know if these datasets are meant to contain the same information (facts/concepts). I wonder what would be the advantage, disadvantage of using one over the other.

— Reply to this email directly, view it on GitHub https://github.com/manusimidt/py-xbrl/issues/103#issuecomment-1420805452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYVRLWQWCYTVUKQQDWSHCTWWJHNNANCNFSM6AAAAAAUAQFU3U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

manusimidt commented 1 year ago

@rayniervanegmond Thank you for the great explanation! I can only agree entirely with what @rayniervanegmond said!

It is true that the SEC also provides XBRL files for iXBRL submissions. However these are converted from the original iXBRL filings, this is a service the SEC provides for compatibility reasons.

image

But I would always prefer to parse iXBRL since it has several benefits.

Regarding your second question (@firmai ): TBH, I did not try the "fast_xbrl_parser" from "TiesdeKok". Seems like it is coded in RUST while 'py-xbrl' is purely python based. Another great open-source library for parsing XBRL is Arelle. It offers many functionalities, way more than 'py-xbrl'. However, this vast range of functionalities also increases complexity. The goal of 'py-xbrl' was always to parse filings and get all of the data as easily as possible, never XBRL validation which is also a huge part of a proper XBRL processor.

rayniervanegmond commented 1 year ago

Hi All,

As you may read from my linkedin profiles I have a fair amount of experience with developing commercial XBRL reporting solutions. My latest position (as an AI/ML engineer in the FinTech industry) the client had done a very extensive evaluation of the Arelle processor. As observed it had a lot of functionalities... The Arelle processor is bloated with functionality and IMO impossible to adapt to one's own requirements.

The outcome was however that the processor was very slow and consumed vast amounts of memory; basically making it unusable for the system to be developed. The solution required a very fast turnaround on any filing submitted to the SEC. Conclusion: the public domain version of the Arelle processor will not work for commercial server-side solutions.

The client asked me to help them implement their own XBRL processor which we did. It runs filings and normalizes them onto a single consolidated taxonomy very fast. Because the code does one thing "read, process and validate XBRL" and none of the ancillary XBRL specification stuff. It is easily maintainable and adaptable to new use-cases.

My advice is to go with a "clean-single-purpose" processor and build from there.

Again - take care - René https://www.linkedin.com/in/rayniervanegmond/

On Tue, Feb 7, 2023 at 2:01 PM Manuel Schmidt @.***> wrote:

@rayniervanegmond https://github.com/rayniervanegmond Thank you for the great explanation! I can only agree entirely with what @rayniervanegmond https://github.com/rayniervanegmond said!

It is true that the SEC also provides XBRL files for iXBRL submissions. However these are converted from the original iXBRL filings, this is a service the SEC provides for compatibility reasons.

[image: image] https://user-images.githubusercontent.com/29599104/217375044-ab2bf65d-38b2-48d0-b315-1fd49e9c50c9.png

But I would always prefer to parse iXBRL since it has several benefits.

Regarding your second question @.*** https://github.com/firmai ): TBH, I did not try the "fast_xbrl_parser" from "TiesdeKok". Seems like it is coded in RUST while 'py-xbrl' is purely python based. Another great open-source library for parsing XBRL is Arelle https://github.com/Arelle/Arelle. It offers many functionalities, way more than 'py-xbrl'. However, this vast range of functionalities also increases complexity. The goal of 'py-xbrl' was always to parse filings and get all of the data as easily as possible, never XBRL validation which is also a huge part of a proper XBRL processor.

— Reply to this email directly, view it on GitHub https://github.com/manusimidt/py-xbrl/issues/103#issuecomment-1421517900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYVRLT3QZHXX4RUO4OO453WWLA35ANCNFSM6AAAAAAUAQFU3U . You are receiving this because you were mentioned.Message ID: @.***>