lukerosiak / pysec

Parse XBRL filings from the SEC's EDGAR in Python
344 stars 147 forks source link

xbrl / xml documents embedded in .txt submission #6

Open jsfenfen opened 11 years ago

jsfenfen commented 11 years ago

In Models.xbrl, xbrl_localpath() assumes the xbrl filename has an .xml extension. But in some cases the xml/xbrl documents appear to have been included in a larger text submission. For example, see here: http://www.sec.gov/Archives/edgar/data/320193/0001193125-12-023398.txt (there are several distinct xbrl files included there). It looks like the text file also includes a binary zip file within a block. All inside a .txt file. Which is, uh, odd.

I came across this while looking into handling 10-Q filings--perhaps this isn't an issue for 10-K's. What do you think is the best way to handle this? Should parsing xml from within the .txt file be part of the download() step? Or is there another file location that these should be pulled from?

lukerosiak commented 11 years ago

The file listed in the SEC's index (the filename attribute on the Index model) points to that single file, which includes the HTML version, its supplements, the XBRL versions, and any images (in binary) all in one, separated by XML-like tags. It's the URL at the Index.html_link() method: http://www.sec.gov/Archives/edgar/data/38067/000119312512146114/0001193125-12-146114.txt

That string can be transformed into the human-readable index of all those components at index_link(): http://www.sec.gov/Archives/edgar/data/38067/000119312512146114/0001193125-12-146114-index.htm

But I don't know how, other than parsing that HTML, to automatically get the URL of the main submission. So I figured I'd access the version of the 10-K filed for human consumption by downloading the bigass file and extracting the HTML chunk.

But as you can see on that sample index page, there's the main html file, then a bunch more HTML files, some of which appear to be interesting tables, others which are more form letter type things. The index says there are 17 documents. And when I search for in the big .txt file, there are 107 occurrences. So I'm not sure what's going on and the .html() method as currently written won't quite do what I had in mind which was capture all the narrative portion of the 10-K laid out for humans. It will only capture some that may or may not be the biggest portion.

In any case, all of that is to permit the possibility of text analysis of narratives. In terms of parsing structured financial data, the xbrl_link() should find the path to the zip file containing the XBRL--I have focused exclusively on 10-Ks on all this, though I figured it would work for all types that have XBRL associated with them... but maybe there is another pattern that is used to build links to quarterlies.

On Tue, Jun 4, 2013 at 12:56 AM, Jacob Fenton notifications@github.comwrote:

In Models.xbrl, xbrl_localpath() assumes the xbrl filename has an .xml extension. But in some cases the xml/xbrl documents appear to have been included in a larger text submission. For example, see here: http://www.sec.gov/Archives/edgar/data/320193/0001193125-12-023398.txt(there are several distinct xbrl files included there). It looks like the text file also includes a binary zip file within a block. All inside a .txt file. Which is, uh, odd.

I came across this while looking into handling 10-Q filings--perhaps this isn't an issue for 10-K's. What do you think is the best way to handle this? Should parsing xml from within the .txt file be part of the download() step? Or is there another file location that these should be pulled from?

— Reply to this email directly or view it on GitHubhttps://github.com/lukerosiak/pysec/issues/6 .

jsfenfen commented 11 years ago

Ok, at some point I'll look more closely at the SEC's spec for this kind of submission (I think it's there). If I'm following, there's a unique seqence_number for each individual piece. My sense is the way to handle this is with a separate model entirely--call it index_document. At the point in which a filing is downloaded, I'd populate index_document. If it's a normal 10K and the xbrl is found, then index_document is just the xbrl file; if it's a giant mishmash of assorted formats tagged as text, then some (all? only the xbrl?) are extracted, saved to the local file system, and entered into the index_document. Not sure this the best path, but I find it appealing because it gives the possibility of including other files for later analysis. Also, it may have some bearing on 8K's referenced here: https://github.com/lukerosiak/pysec/issues/3 -- but I'm not sure.