Closed mrx23dot closed 3 years ago
Figured it out, works very fast.
cache = HttpCache(dir)
url = 'https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/0000320193-21-000056-xbrl.zip'
# download zip and extract
cache.cache_edgar_enclosure(url)
# todo find entry file
# process entry file
inst = XbrlParser(cache).parse_instance_locally(entryFile)
Is it possible to automatically find the entry file? One zip can have many xml files, with random names.
In the case of SEC submissions you could use the indices like the Structured Disclosure RSS Feeds to figure out which file the instance file is.
Finding a reliable general way to find the name of the instance document will probably be difficult. There is no general specification that defines the file names in the zip folder and which files are allowed in the zip enclosure.
The monthly rss feeds from the SEC also contain the link to the zip folder enclosures for the submission. Note that not every submission has a zip enclosure.
Which SEC fillings usually don't have zip?
Here is one way to identify entry file in dir, by checking which file is not included by others, seems to be working well.
import os
from pathlib import Path
def find_entry_file(dir):
""" find most likelly entry file in filling dir """
# ignore useless files
validFiles = []
entryCandidates = []
for ext in '.htm .xml .xsd'.split(): # valid extensions in priority
for f in os.listdir(dir):
fFull = os.path.join(dir,f)
if os.path.isfile(fFull) and f.lower().endswith(ext):
validFiles.append(fFull)
# find first which is not included in by other
for f1 in validFiles:
fdir, fileNm = os.path.split(f1)
# foreach file check all other for inclusion
foundInOther = False
for f2 in validFiles:
if f1!=f2:
if fileNm in Path(f2).read_text():
foundInOther = True
break
if foundInOther == False:
entryCandidates.append(f1)
# todo if multiple choose biggest
return entryCandidates
entryFile = find_entry_file(dir='ff')
print(entryFile)
Which SEC fillings usually don't have zip? @mrx23dot Unfortunately, I do not know exactly. There are a few months where all zip enclosures are missing. Apart from that, it looks pretty random to me. However, with the newer ones, however, just about every submission has a Zip Enclosure.
Yes, that's how it should work. It's a bit cumbersome but if you don't want to access external indices maybe a solution :)
Please report discovered problems to structureddata@sec.gov – questions about structured data (e.g., XBRL; XML; FpML; FIX) they are happy to look into it.
I'm trying to speed up the download, and read that the lib supports zip download, but looks like it's not complete yet. The provided zip file is a valid one. py-xbrl==2.0.4
gives