manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
118 stars 46 forks source link

Missing fact from ixbrl #69

Closed mrx23dot closed 2 years ago

mrx23dot commented 3 years ago

For https://www.sec.gov/ix?doc=/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm

The lib doesn't extract DocumentPeriodEndDate at all, it's there on the web view. First time I have seen this problem. Even if it's nested I still got back something. I don't think I'm filtering it out.

It only founds these

 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalYearFocus: 2021 2021
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalPeriodFocus: Q1 Q1
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalYearFocus: 2021 2021
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalPeriodFocus: Q1 Q1

In source:

<span class=3D"html-attribute-name">name</span>=3D"<spa=
n class=3D"html-attribute-value">dei:DocumentPeriodEndDate</span>" <span cl=
ass=3D"html-attribute-name">id</span>=3D"<span class=3D"html-attribute-valu=
e">Narr_VydiiUz0MUOCCrLq-p-Mpw</span>"&gt;</span><span class=3D"html-tag">&=
lt;b <span class=3D"html-attribute-name">style</span>=3D"<span class=3D"htm=
l-attribute-value">font-weight:bold;</span>"&gt;</span>March 31, 2021<span =
class=3D"html-tag">&lt;/b&gt;</span>

Does it work for you?

jonkatz6 commented 3 years ago

using the url above the library did not work. using the url: https://www.sec.gov/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm worked but did not result in a found DocumentPeriodEndDate.

using the url for the xbrl itself : https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml did work to find the DocumentPeriodEndDate

inst: XbrlInstance = xbrlParser.parse_instance('https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml')
for i in inst.facts:
    if i.concept.name == 'DocumentPeriodEndDate':
        print(i.concept)
        print(i.value)
out:
DocumentPeriodEndDate
2021-03-31

Sorry if you were looking for any information as to why, but I hope this helps.

mrx23dot commented 3 years ago

Interesting the xml version extracts the DocumentPeriodEndDate: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml

But the ixbrl original htm doesn't: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm

Sounds like a parsing error. manusimidt told me we should prefer ixbrl htm (original filling) over the SEC extracted xml.

manusimidt commented 3 years ago

Yes, looks like a parsing error. The iXBRL Instance Document certainly contains the DocumentPeriodEndDate fact.

<ix:nonNumeric format="ixt:datemonthdayyearen" 
  contextRef="Duration_1_1_2021_To_3_31_2021_Pi9QpSqr-0e0RF1F9GgqSg" 
  name="dei:DocumentPeriodEndDate" 
  id="Narr_VydiiUz0MUOCCrLq-p-Mpw">
  <b style="font-weight:bold;">
    March 31, 2021
  </b>
</ix:nonNumeric>

I think the fact is not parsed because it contains additional HTML Elelemts (the bold tag).

manusimidt commented 3 years ago

Yes, this is the issue: https://github.com/manusimidt/py-xbrl/blob/a2aca03bd2cef1853c75ba8af25a04d0d250edc3/xbrl/instance.py#L416-L417

I will implement a fix.

mrx23dot commented 3 years ago

In bs4 we have these: .text is recursive (what we want), there should be an equivalent for it in etree .string is only for one given item (wouldn't go into bold tag)

manusimidt commented 3 years ago

Yes, thats correct. With bs4 it is really easy to extract the text recursively for the given element. I could not find any equivalent for it in etree. Please let me know if you find a solution.

I am currently implementing a function that extracts the text recursively but i don't know if that is the best way of doing.

mrx23dot commented 3 years ago

Doc says: xml.etree.ElementTree.tostring(element, encoding="us-ascii", method="xml", *, short_empty_elements=True)

"Generates a string representation of an XML element, including all subelements."
-> so it should be recursive too, might need playing with parameters.

*short_empty_elements is from v3.4

gety9 commented 2 years ago

@mrx23dot @jonkatz6

using the url above the library did not work. using the url: https://www.sec.gov/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm worked but did not result in a found DocumentPeriodEndDate.

using the url for the xbrl itself : https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml did work to find the DocumentPeriodEndDate

inst: XbrlInstance = xbrlParser.parse_instance('https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml')
for i in inst.facts:
    if i.concept.name == 'DocumentPeriodEndDate':
        print(i.concept)
        print(i.value)
out:
DocumentPeriodEndDate
2021-03-31

Sorry if you were looking for any information as to why, but I hope this helps.

and

Interesting the xml version extracts the DocumentPeriodEndDate: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml

But the ixbrl original htm doesn't: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm

Sounds like a parsing error. manusimidt told me we should prefer ixbrl htm (original filling) over the SEC extracted xml.

guys, do you still have parser working for ixblr urls?

i am using py-2.0.7 and https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml (xblr) works

https://www.sec.gov/ix?doc=/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm (ixblr url with ? = characters) doesn't work at all

PermissionError: [Errno 1] Operation not permitted

but and https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm (ixblr clean url) does not work at all

ParseError: not well-formed (invalid token): line 9, column 1106

it worked for you at least particially (without DocumentPeriodEndDate), so i wonder what's the reason.