mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Make function which can scrape the title pages of SC 13D(/A) and 13G(/A) #71

Open bdcallen opened 4 years ago

bdcallen commented 4 years ago

@iangow

crsp=# SELECT COUNT(*) FROM edgar.sc13dg_indexes;
  count  
---------
 1368362
(1 row)

crsp=# SELECT COUNT(*) FROM edgar.sc13dg_indexes
crsp-# WHERE success;
  count  
---------
 1310116
(1 row)

crsp=# SELECT COUNT(*) FROM edgar.sc13dg_indexes 
WHERE cover_page_start > 0
AND NOT cover_page_q1_start > 0;
 count 
-------
     0
(1 row)

crsp=# SELECT COUNT(*) FROM edgar.sc13dg_indexes 
WHERE NOT cover_page_start > 0
AND cover_page_q1_start > 0;
 count 
-------
     0
(1 row)

crsp=# SELECT COUNT(*) FROM edgar.sc13dg_indexes 
WHERE cover_page_start > 0
AND cover_page_q1_start > 0;
  count  
---------
 1292126
(1 row)

I don't know if you have looked at my initial readme file for sc13dg_indexes yet, but one of the things I state is that one of the most useful and cleanly found features of the table is the variable cover_page_start. This is defined to be the beginning of the section with the so-called cover pages, which are the pages with lists of questions, 1 to 14 for SC 13D(/A), and 1 to 12 for SC 13G(/A), with the first question typically being stated with a string like 1. NAME OF REPORTING PERSON AND/OR S.S. OR I.R.S. IDENTIFICATION NO. OF ABOVE PERSON, and the last question being 14. TYPE OF REPORTING PERSON (SEE INSTRUCTIONS): (12 for SC 13G). In the overwhelming majority of cases (more than 99.9 percent at least), the title page precedes the beginning of the cover page section, and the title page typically contains important information, such as the cusip numbers, then names and classes of the securities, the date of the filing, the name of the issuer, and the amendment number of the form if the filing is a SC 13D/A or SC 13G/A. Some of this information can of course be extracted from other parts of the filing or from elsewhere. However, I believe scraping the title page would be the cleanest, and most reliable, way of extracting the cusips associated with a filing accurately, including the correct number of cusips (some forms have more than one). This information could be used later to help clean the scraping of information from other parts of the form, if we choose to do so.