manusimidt / py-xbrl

Python-based parser for parsing XBRL and iXBRL files
https://py-xbrl.readthedocs.io/en/latest/
GNU General Public License v3.0
100 stars 37 forks source link

zip download fails #45

Closed mrx23dot closed 3 years ago

mrx23dot commented 3 years ago

I'm trying to speed up the download, and read that the lib supports zip download, but looks like it's not complete yet. The provided zip file is a valid one. py-xbrl==2.0.4

from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance

cache: HttpCache = HttpCache('./cache')
cache.set_headers({'From': 'test@gmail.com', 'User-Agent': 'zipper 1'})
xbrlParser = XbrlParser(cache)

url = 'http://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/aapl-20210327.htm' # ok
url = 'https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/0000320193-21-000056-xbrl.zip' #nok
inst = XbrlParser(cache).parse_instance(url) 

for fact in inst.facts:
  print(fact.concept.name)

gives

Traceback (most recent call last):
  File "C:\Users\Downloads\4\zip_test.py", line 10, in <module>
    inst = XbrlParser(cache).parse_instance(url) # here to be able free up
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 626, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\Python37\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1297, in read_events
    raise event
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2
mrx23dot commented 3 years ago

Figured it out, works very fast.

cache = HttpCache(dir)
url = 'https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/0000320193-21-000056-xbrl.zip'
# download zip and extract
cache.cache_edgar_enclosure(url)

# todo find entry file

# process entry file
inst = XbrlParser(cache).parse_instance_locally(entryFile)

Is it possible to automatically find the entry file? One zip can have many xml files, with random names.

manusimidt commented 3 years ago

In the case of SEC submissions you could use the indices like the Structured Disclosure RSS Feeds to figure out which file the instance file is.

image

Finding a reliable general way to find the name of the instance document will probably be difficult. There is no general specification that defines the file names in the zip folder and which files are allowed in the zip enclosure.

manusimidt commented 3 years ago

The monthly rss feeds from the SEC also contain the link to the zip folder enclosures for the submission. Note that not every submission has a zip enclosure.

mrx23dot commented 3 years ago

Which SEC fillings usually don't have zip?

Here is one way to identify entry file in dir, by checking which file is not included by others, seems to be working well.

import os
from pathlib import Path

def find_entry_file(dir):
  """ find most likelly entry file in filling dir """

  # ignore useless files
  validFiles = []
  entryCandidates = []

  for ext in '.htm .xml .xsd'.split(): # valid extensions in priority
    for f in os.listdir(dir):
      fFull = os.path.join(dir,f)
      if os.path.isfile(fFull) and f.lower().endswith(ext):
        validFiles.append(fFull)

  # find first which is not included in by other
  for f1 in validFiles:
    fdir, fileNm = os.path.split(f1)
    # foreach file check all other for inclusion
    foundInOther = False
    for f2 in validFiles:
      if f1!=f2:
        if fileNm in Path(f2).read_text():
          foundInOther = True
          break

    if foundInOther == False:
      entryCandidates.append(f1)

  # todo if multiple choose biggest

  return entryCandidates

entryFile = find_entry_file(dir='ff')
print(entryFile)
manusimidt commented 3 years ago

Which SEC fillings usually don't have zip? @mrx23dot Unfortunately, I do not know exactly. There are a few months where all zip enclosures are missing. Apart from that, it looks pretty random to me. However, with the newer ones, however, just about every submission has a Zip Enclosure.

Yes, that's how it should work. It's a bit cumbersome but if you don't want to access external indices maybe a solution :)

mrx23dot commented 3 years ago

Please report discovered problems to structureddata@sec.gov – questions about structured data (e.g., XBRL; XML; FpML; FIX) they are happy to look into it.