mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Design function returning start/end indices for key sections in SC 13D and SC 13G #69

Open bdcallen opened 4 years ago

bdcallen commented 4 years ago

@iangow

Look at this Schedule 13D for instance, which is pretty typical of most of these forms (there are >some exceptions of course, which will be more difficult). It has the part at the start associated with >the header file. Then there is a title page (which includes the cusip number underneath which is >"(Cusip Number)" on the next line) which usually ends with a paragraph that reads at the end "...but >shall be subject to all other provisions of the Act (however, see the Notes).", though in this case >there is a footnote. Then there is a set of cover pages (in this case just one, but can be more than >one, particularly when there is more than one cusip involved), which in SC 13D has questions 1 >(Name of Reporting Person) through to 14 (Type of Reporting Person) (in SC 13G it is 1 to 12). >Then there is a section which contains the "Items" of the filing, usually 1 through to 10 (on >amendments, the items where there has been no amendment are usually omitted). Finally, Item 10 >contains the certification statement, which is then followed by the signatures, and then the exhibits >(the indexes/titles of which are usually stated in Item 7). I actually have been working to scrape the >whole of these documents, first by separating out the different section. Furthermore, I think the >cusip numbers we get can be a whole lot cleaner if we scrape the whole form, as we can localize >where the cusips are usually found, and then potentially guess what the cusips are in the case that >they have less than 8 characters using other information in the form (for instance 'Common Stock' >is almost always the first security for which a cusip is assigned for a given issuer, and normally the >7th and 8th digits (the issue identifier) are '10' for the first security assigned a cusip).

Looking at this initial comment from issue #62, the vast majority of SC 13D and 13G forms seem to follow a given structure, starting with the header, then a title page, then the cover pages with questions 1 to 14 (or 12 for 13G), then an item section, then signatures, then exhibits. For the vast majority of forms (something like 90%), the starts and ends of these section can be found through a number of key regular expressions, just like the simpler case with the cusip numbers.

I think we need a function which finds the starting and ending indices of these sections, as well as other information such as whether the form is of an alternate style, upper/lower bounds, and so on. I also think it would be helpful to have a program which makes this function write to a table in the database, so that we can get key information on cases which do not follow the normal pattern.