Closed mrx23dot closed 2 years ago
using the url above the library did not work.
using the url: https://www.sec.gov/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm
worked but did not result in a found DocumentPeriodEndDate.
using the url for the xbrl itself : https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml
did work to find the DocumentPeriodEndDate
inst: XbrlInstance = xbrlParser.parse_instance('https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml')
for i in inst.facts:
if i.concept.name == 'DocumentPeriodEndDate':
print(i.concept)
print(i.value)
out:
DocumentPeriodEndDate
2021-03-31
Sorry if you were looking for any information as to why, but I hope this helps.
Interesting the xml version extracts the DocumentPeriodEndDate: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml
But the ixbrl original htm doesn't: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm
Sounds like a parsing error. manusimidt told me we should prefer ixbrl htm (original filling) over the SEC extracted xml.
Yes, looks like a parsing error. The iXBRL Instance Document certainly contains the DocumentPeriodEndDate fact.
<ix:nonNumeric format="ixt:datemonthdayyearen"
contextRef="Duration_1_1_2021_To_3_31_2021_Pi9QpSqr-0e0RF1F9GgqSg"
name="dei:DocumentPeriodEndDate"
id="Narr_VydiiUz0MUOCCrLq-p-Mpw">
<b style="font-weight:bold;">
March 31, 2021
</b>
</ix:nonNumeric>
I think the fact is not parsed because it contains additional HTML Elelemts (the bold tag).
Yes, this is the issue: https://github.com/manusimidt/py-xbrl/blob/a2aca03bd2cef1853c75ba8af25a04d0d250edc3/xbrl/instance.py#L416-L417
I will implement a fix.
In bs4 we have these: .text is recursive (what we want), there should be an equivalent for it in etree .string is only for one given item (wouldn't go into bold tag)
Yes, thats correct. With bs4 it is really easy to extract the text recursively for the given element. I could not find any equivalent for it in etree. Please let me know if you find a solution.
I am currently implementing a function that extracts the text recursively but i don't know if that is the best way of doing.
Doc says: xml.etree.ElementTree.tostring(element, encoding="us-ascii", method="xml", *, short_empty_elements=True)
"Generates a string representation of an XML element, including all subelements."
-> so it should be recursive too, might need playing with parameters.
*short_empty_elements is from v3.4
@mrx23dot @jonkatz6
using the url above the library did not work. using the url:
https://www.sec.gov/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm
worked but did not result in a found DocumentPeriodEndDate.using the url for the xbrl itself :
https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml
did work to find the DocumentPeriodEndDateinst: XbrlInstance = xbrlParser.parse_instance('https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml') for i in inst.facts: if i.concept.name == 'DocumentPeriodEndDate': print(i.concept) print(i.value)
out: DocumentPeriodEndDate 2021-03-31
Sorry if you were looking for any information as to why, but I hope this helps.
and
Interesting the xml version extracts the DocumentPeriodEndDate: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml
But the ixbrl original htm doesn't: https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm
Sounds like a parsing error. manusimidt told me we should prefer ixbrl htm (original filling) over the SEC extracted xml.
guys, do you still have parser working for ixblr urls?
i am using py-2.0.7 and https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q_htm.xml (xblr) works
https://www.sec.gov/ix?doc=/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm (ixblr url with ? = characters) doesn't work at all
PermissionError: [Errno 1] Operation not permitted
but and https://www.sec.gov/Archives/edgar/data/1365135/000155837021005716/wu-20210331x10q.htm (ixblr clean url) does not work at all
ParseError: not well-formed (invalid token): line 9, column 1106
it worked for you at least particially (without DocumentPeriodEndDate), so i wonder what's the reason.
For https://www.sec.gov/ix?doc=/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm
The lib doesn't extract DocumentPeriodEndDate at all, it's there on the web view. First time I have seen this problem. Even if it's nested I still got back something. I don't think I'm filtering it out.
It only founds these
In source:
Does it work for you?