mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Download Schedule 13D and 13G into 6TB #62

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow

Look at this Schedule 13D for instance, which is pretty typical of most of these forms (there are some exceptions of course, which will be more difficult). It has the part at the start associated with the header file. Then there is a title page (which includes the cusip number underneath which is "(Cusip Number)" on the next line) which usually ends with a paragraph that reads at the end "...but shall be subject to all other provisions of the Act (however, see the Notes).", though in this case there is a footnote. Then there is a set of cover pages (in this case just one, but can be more than one, particularly when there is more than one cusip involved), which in SC 13D has questions 1 (Name of Reporting Person) through to 14 (Type of Reporting Person) (in SC 13G it is 1 to 12). Then there is a section which contains the "Items" of the filing, usually 1 through to 10 (on amendments, the items where there has been no amendment are usually omitted). Finally, Item 10 contains the certification statement, which is then followed by the signatures, and then the exhibits (the indexes/titles of which are usually stated in Item 7). I actually have been working to scrape the whole of these documents, first by separating out the different section. Furthermore, I think the cusip numbers we get can be a whole lot cleaner if we scrape the whole form, as we can localize where the cusips are usually found, and then potentially guess what the cusips are in the case that they have less than 8 characters using other information in the form (for instance 'Common Stock' is almost always the first security for which a cusip is assigned for a given issuer, and normally the 7th and 8th digits (the issue identifier) are '10' for the first security assigned a cusip).

bdcallen commented 4 years ago

@iangow I have just committed the code that I have currently written of the last few weeks on this issue

bdcallen commented 4 years ago

@iangow Just before the Christmas break I ran download_filing_docs.R on the SC 13D, SC 13D/A, SC 13G and SC 13 G/A forms, subsetting on the documents of type txt. Just over 2 million documents were downloaded, according to filing_docs_processed. According to the file management system, the size is around 100.7GB

num_sc_13dg

Note, the number of items in the folder I counted seems to include the subdirectories as well as the actual files.

Today, I've started looking at what failed to be downloaded and why.

iangow commented 4 years ago

OK. If this is pretty much done, it might make sense to make a separate issue for each of the ways that downloading failed, link those back to this issue, and close this one.

bdcallen commented 4 years ago

OK. If this is pretty much done, it might make sense to make a separate issue for each of the ways that downloading failed, link those back to this issue, and close this one.

@iangow I initially meant this issue as a catch all to scrape the forms into tables, as well as download them. Though it probably makes more sense to make smaller issues for the functions needed to do the latter step. So I will rename this issue as a download issue then close.

bdcallen commented 4 years ago

@iangow See my latter comments here with regards to those documents linked to filing_docs_alt, and with cases which were not downloaded at all. I will now close this