add support for file-like object parsing

rAndrewNichol commented 2 years ago

In my application it doesn't make sense to store the data locally since my disk is ephemeral. I also don't pull directly from the web for concurrency and other reasons. Rather, I pull it directly from cloud object storage (s3) and parse the results.

For that reason I wanted to be able to use StringIO to parse the text content directly. The package did not provide any such support.

usage:

from io import StringIO

from xbrl.instance import XbrlParser
from xbrl.cache import HttpCache

PATH =  "/Users/andrewnichol/Downloads/0000320193-18-000145-xbrl/aapl-20180929.xml"

cache: HttpCache = HttpCache("./cache")
cache.set_headers({"From": "webmaster@domain.com", "User-Agent": "domain.com"})
xbrlParser = XbrlParser(cache)

x = open(PATH, "r")
string_file = StringIO(x.read())
inst = xbrlParser.parse_file_obj(string_file)  # could also operate directly on `x` here

Since there is no simple way to infer the file format explicitly from a StringIO object (from a file object you could simply use file_obj.name), i decided it would be best as a separate method with an is_xblr parameter for the user to explicitly specify whether it is xblr or other (ixblr).

This had a lot more places to change than I had originally expected, but in the end it works pretty seamlessly.

manusimidt commented 2 years ago

Hey @rAndrewNichol,

thank you for your pull request! I will review and merge it this weekend at the latest :)

Greetings, Manu

manusimidt commented 2 years ago

There are some issues with this pull request:

1. is_xbrl flag

I tried your code with the same submission aapl-20180929.xml. In line 698 of instance.py you set the flag is_ixbrl=True. However the submission aapl-20180929.xml is clearly not an inline XBRL file. If I set is_ixbrl=False the code crashes. But this would be the correct flag for this particular submission.

2. Fundamental problem with SEC submissions parsed via StringIO

TLDR; An SEC Submission not only consists of the Instance document. You must provide ALL XBRL files that belong to the submission. For the SEC this means the following files are needed to parse the submission properly: Instance Document, Taxonomy Extension Schema, Additional Linkbases (label, calculation, definition, presentation)

In XBRL reporting, a distinction is made between two different disclosure strategies - the open reporting cycle and the closed reporting cycle:

Closed Reporting Cycle: In a closed reporting cycle the company can only use the prescribed taxonomy. An example would be the UK. There the regulators have decided that entities are only allowed to use the IFRS taxonomy for tagging.

Open Reporting Cycle: In an open reporting cycle the filing entity is allowed to create a taonomy extension that can introduce new concepts and/or override concepts and links from the base taxonomy. An Example would be SEC submissions. Here each filing (10-K, 10-Q) consists of an instance file and additionally the taonxomy.

Here comes the problem: The taxonomy schema file is imported in the instance document via a relative link. For example in the submission you provided the extension taxonomy schema is imported via the following code:

<link:schemaRef xlink:href="aapl-20180929.xsd" xlink:type="simple"/>

But if you only give the instance document as StringIO to the parser, it won't be able to find this taxonomy. (Where should it search?). In your pull request you added the following in line 423 of instance.py.

    elif isinstance(instance_path, IOBase):
        taxonomy: TaxonomySchema = parse_taxonomy(instance_path, cache)

So you are passing the instance_path to the parse_taxonomy function which makes no sense because the instance_path contains the StringIO verision of the instance file and has nothing to do with the taxonomy. Therefore no concepts can be extracted. Further in line 339 you bypass this problem that the taxonomy could not be parsed correcly by adding the following if statement in line 342:

if concept_name in tax.name_id_map:

This means that no concepts of the extension taxonomy and thus all facts that where tagged with these concepts are not parsed and ignored.

You can also see this by comparing the output of the current existing function parse_instance_locally() with the new proposed function parse_file_obj():

Possible solutions

Mount the S3 storage as drive and then use the parse_instance_locally() function
It would be possible to achive a similar behaviour if you store the Zip Enclosure of the SEC submission (for your example this would be this file) on the S3 Storage and then provide the StringIO of the ZipFile to the libary. Then a function could be implemented that takes this StringIO, Extracts the Files, finds out what is the instance file, the schema file and the linkbases files, and then uses their stringIO representation to fully parse the document.

@rAndrewNichol Please let me know if you need further clarification or some tips how the latter can be implemented :).

Best regards, Manuel

manusimidt / py-xbrl