mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Find comprehensive list of regular expressions for extracting cusips in making edgar.cusips #60

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow This issue is for making notes of regular expressions from which we can extract cusip numbers, as I just alluded to in a previous post in #11. So far we've got three

cusip_hdr = r'CUSIP\s+(?:No\.|#|Number):?'
cusip_fmt = r'([0-9A-Z]{1,3}[\s-]?[0-9A-Z]{3,5}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1}|' \
                + '[0-9A-Z]{4,6}[\s-]?[0-9A-Z]{2}[\s-]?[0-9A-Z]{1})'

regex_dict = {'A': cusip_hdr + '[\t\r\n\s]+' + cusip_fmt,
              'B': cusip_fmt + r'[\n]?[_-]?\s+(?:[_-]{9,})?[\s\r\t\n]*\(CUSIP Number\)',
              'C': cusip_fmt + '[\s\t\r]*[\n]?' + '[\s\t\r]*\(CUSIP Number of Class of Securities\)'
                 }
iangow commented 4 years ago

I think you could for now leave the second one as D if that's what it was in the original. This might make it easier to do the comparison with the current data as a "test table". You might put ['A', 'D', 'C'] in a list if you want to apply the regexes in a certain order (dictionaries have no order naturally).

Also, it might be easiest to put the matched regexes in a list (e.g., ['A', 'C']) which is then stored as an array in PostgreSQL. Though it easy to accomplish this from what you have anyway:

crsp=# SELECT string_to_array('AB', NULL);
 string_to_array 
-----------------
 {A,B}
(1 row)
bdcallen commented 4 years ago

@iangow I think we should obviously use this issue to list filings which are not fully processed, at least in the sense of not extracting all the instances of cusip numbers in the text, and to make subsequent issues for them along the way if necessary. I'll start off with a couple, this one has two cusip numbers above the line with the (CUSIP Number) piece. My function get_cusip_cik currently only gets one of the numbers for the relevant pattern. This one also has two numbers above the mentioned line, but with labels Common Stock, Class A: and Common Stock, Class B: in front of them.

bdcallen commented 4 years ago

@iangow This one should really have been caught by the pattern described as 'D' in your original perl file, but the number of dashes in the dashed line falls the below the minimum number of 9 (it has 8 dashes).

bdcallen commented 4 years ago

And this one has no dashes in the line between the number and the (CUSIP Number) line

iangow commented 4 years ago

Again,

@iangow This one should really have been caught by the pattern described as 'D' in your original perl file, but the number of dashes in the dashed line falls the below the minimum number of 9 (it has 8 dashes).

Check whether it (or a similar case) was detected by the old Perl code. I think we want to replicate that first, then worry about improving it. The reality is that a "statistical" approach here is fine. We don't need to get all filings, just enough to have enough data on a given CUSIP-CIK combination to detect it and to be confident that it's valid.