mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Address cusips with 7 digits #87

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow As discussed in #84, this issue is for analysing cusips with 7 characters.

iangow commented 4 years ago

@bdcallen Just FYI, changed my status for this repository to "not watching". So if you want me to see a comment here, just "mention" me. But this means you can structure comments on issues without worrying about it pinging me, etc. (eg, you can edit, delete comments).

bdcallen commented 4 years ago
> cusip_cik %>% filter(cusip_length == 7) %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   613

> cusip_cik %>% filter(cusip_length == 7) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>% filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   106

> cusip_cik %>% 
> filter(cusip_length == 7) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    19

> cusip_cik %>% filter(cusip_length == 7) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>% filter(cusip_length.y == 8) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    47

> cusip_cik %>% filter(cusip_length == 7) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 8) %>% 
> distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    14
bdcallen commented 4 years ago
> cusip7s <- cusip_cik %>% filter(cusip_length == 7)
> cusip7s
# A tibble: 3,833 x 20
     cik cusip6 n_cusip6 cusip multiplicity file_name check_digit company_name formats cusip_length num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4
   <int> <chr>     <int> <chr>        <int> <chr>           <int> <chr>        <chr>          <int>            <dbl> <chr>  <chr>    <chr>    <chr>    <chr>   
 1  2062 461620        2 4616…            2 edgar/da…           7 ACMAT CORP   ACD                7                6 46162… 4616207  461620   461627   461607  
 2  2062 461620        2 4616…            2 edgar/da…           7 ACMAT CORP   ACD                7                6 46162… 4616207  461620   461627   461607  
 3  2145 504110        2 5041…            2 edgar/da…           8 ACTION INDU… C                  7                5 50411… 5041108  504110   504118   504108  
 4  2145 504110        2 5041…            2 edgar/da…           8 ACTION INDU… C                  7                5 50411… 5041108  504110   504118   504108  
 5  2491 364541        2 3645…            2 edgar/da…           3 ALLIANCE GA… C                  7                6 36454… 3645410  364541   364540   364510  
 6  2491 364541        2 3645…            2 edgar/da…           3 ALLIANCE GA… C                  7                6 36454… 3645410  364541   364540   364510  
 7  2809 847410        6 8474…            6 edgar/da…           8 AGNICO EAGL… ACD                7                5 84741… 8474108  847410   847418   847408  
 8  2809 847410        6 8474…            6 edgar/da…           8 AGNICO EAGL… ACD                7                5 84741… 8474108  847410   847418   847408  
 9  2809 847410        6 8474…            6 edgar/da…           8 AGNICO EAGL… ACD                7                5 84741… 8474108  847410   847418   847408  
10  2809 847410        6 8474…            6 edgar/da…           8 AGNICO EAGL… ACD                7                5 84741… 8474108  847410   847418   847408  
# … with 3,823 more rows, and 4 more variables: cusip3_5 <chr>, cusip2_6 <chr>, cusip1_7 <chr>, cusip0_8 <chr>

> cusip7s %>% filter(multiplicity >= 10) %>% distinct(cik, cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   111

> seven_w_89 <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>% 
      filter(cusip_length.y >= 8) %>% distinct(cik, cusip.x, multiplicity.x) %>% 
      rename(cusip = cusip.x, multiplicity = multiplicity.x) %>% 
      filter(multiplicity >= 10) %>% distinct(cik, cusip) 

> cusip7s %>% anti_join(seven_w_89) %>% distinct(cik, cusip) %>% count()
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
      n
  <int>
1   508

> cusip7s %>% anti_join(seven_w_89) %>% filter(multiplicity >= 10) %>% distinct(cik, cusip) %>% count() 
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
      n
  <int>
1     6

cases_7 <- cusip7s %>% anti_join(seven_w_89) %>% 
                     filter(multiplicity >= 10) %>% 
                     distinct(cik, cusip)

cases_7 <- cases_7 %>% mutate(cusip6 = substr(cusip, 1, 6), cusip_1pad = paste0('0', cusip),
                                   check_digit_1pad = calculate_cusip_check_digit(cusip_1pad), 
                                   cusip_2pad = paste0('00', cusip), 
                                   check_digit_2pad = calculate_cusip_check_digit(cusip_2pad), 
                                   cusip6_1pad = substr(cusip_1pad, 1, 6), 
                                   cusip6_2pad = substr(cusip_2pad, 1, 6))

> cases_7 %>% mutate(ncusip = cusip_1pad) %>% inner_join(stocknames)
Joining, by = c("cusip", "ncusip")
# A tibble: 0 x 24
# … with 24 variables: cik <int>, cusip <chr>, cusip6 <chr>, cusip_1pad <chr>, check_digit_1pad <dbl>, cusip_2pad <chr>, check_digit_2pad <dbl>, cusip6_1pad <chr>,
#   cusip6_2pad <chr>, ncusip <chr>, permno <int>, permco <int>, namedt <date>, nameenddt <date>, ticker <chr>, comnam <chr>, hexcd <dbl>, exchcd <dbl>, siccd <dbl>,
#   shrcd <dbl>, shrcls <chr>, st_date <date>, end_date <date>, namedum <dbl>
> cases_7 %>% mutate(ncusip = substr(cusip_2pad, 1, 8)) %>% inner_join(stocknames)
Joining, by = c("cusip", "ncusip")
# A tibble: 0 x 24
# … with 24 variables: cik <int>, cusip <chr>, cusip6 <chr>, cusip_1pad <chr>, check_digit_1pad <dbl>, cusip_2pad <chr>, check_digit_2pad <dbl>, cusip6_1pad <chr>,
#   cusip6_2pad <chr>, ncusip <chr>, permno <int>, permco <int>, namedt <date>, nameenddt <date>, ticker <chr>, comnam <chr>, hexcd <dbl>, exchcd <dbl>, siccd <dbl>,
#   shrcd <dbl>, shrcls <chr>, st_date <date>, end_date <date>, namedum <dbl>

> cases_7 %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 5 x 7
      cik cusip   cusip6 company_name                            issuer_name_1          issuer_adl_1 issuer_status
    <int> <chr>   <chr>  <chr>                                   <chr>                  <chr>        <chr>        
1   74702 6838361 683836 OPTICAL RADIATION CORP                  OPTICAL RADIATION CORP NA           A            
2  896717 5949490 594949 COMMONWEALTH ASSOCIATES GROWTH FUND INC MICROCAP FD INC        NA           A            
3  896717 5949490 594949 MICROCAP FUND INC                       MICROCAP FD INC        NA           A            
4  896717 5949490 594949 MICROCAP FUND IND                       MICROCAP FD INC        NA           A            
5 1022259 74157E1 74157E PRIME SERVICE INC                       PRIME SVC INC          NA           A            

> cases_7 %>% rename(cusip6_orig = cusip6, cusip6 = cusip6_1pad) %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 6 x 7
     cik cusip   cusip6 company_name                            issuer_name_1               issuer_adl_1                  issuer_status
   <int> <chr>   <chr>  <chr>                                   <chr>                       <chr>                         <chr>        
1  74702 6838361 068383 OPTICAL RADIATION CORP                  BARRE TR CO VT              MERGED INTO MERCHANTS NATL BK D            
2 314590 6777450 067774 SASOL LTD/ADR/                          BARNES & NOBLE INC          NA                            A            
3 314590 6777450 067774 SASOL LTD                               BARNES & NOBLE INC          NA                            A            
4 896717 5949490 059494 COMMONWEALTH ASSOCIATES GROWTH FUND INC BANC AMER ALTERNATIVE LN TR NA                            A            
5 896717 5949490 059494 MICROCAP FUND INC                       BANC AMER ALTERNATIVE LN TR NA                            A            
6 896717 5949490 059494 MICROCAP FUND IND                       BANC AMER ALTERNATIVE LN TR NA                            A

> cases_7 %>% rename(cusip6_orig = cusip6, cusip6 = cusip6_2pad) %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 7
      cik cusip   cusip6 company_name      issuer_name_1                 issuer_adl_1 issuer_status
    <int> <chr>   <chr>  <chr>             <chr>                         <chr>        <chr>        
1 1022259 74157E1 007415 PRIME SERVICE INC ADVANCE COSMETIC & ANTI-AGING NA           A 
bdcallen commented 4 years ago

Here is a set of matches with two leading zeros added to the left of the 7 digit cusip.

> View(cusip7s %>% inner_join(cusip_cik, by = 'cik') %>% 
+         filter(cusip_length.y == 9 & 
                   cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>% 
          distinct(cik, cusip.x, cusip.y) %>% 
+         filter(grepl('^[0]{2}', cusip.y)) %>% 
           mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>% 
           filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>% 
           mutate(ncusip = substr(cusip.y, 1, 8)) %>% 
           inner_join(stocknames) %>% left_join(ciks) %>% 
           distinct(cik, cusip.x, cusip.y, ncusip, company_name, comnam))
cik cusip.x cusip.y ncusip company_name comnam
1 2062 4616207 004616207 00461620 ACMAT CORP ACMAT CORP
2 2145 5041108 005041108 00504110 ACTION INDUSTRIES INC ACTION INDUSTRIES INC
3 2809 8474108 008474108 00847410 AGNICO EAGLE MINES LTD AGNICO EAGLE MINES LTD
4 60911 7548100 007548100 00754810 ADVANCED TECHNICAL PRODUCTS INC ADVANCED TECHNICAL PRODUCTS INC
5 60911 7548100 007548100 00754810 LUNN INDUSTRIES INC /DE/ ADVANCED TECHNICAL PRODUCTS INC
6 74783 4308102 004308102 00430810 ACCEPTANCE INSURANCE COMPANIES INC ACCEPTANCE INSURANCE COS INC
7 319126 1058106 001058106 00105810 AFP IMAGING CORP A F P IMAGING CORP
8 353184 9224304 009224304 00922430 AIR T INC AIR TRANSPORTATION HOLDING INC
9 353184 9224304 009224304 00922430 AIR TRANSPORTATION HOLDING CO INC AIR TRANSPORTATION HOLDING INC
10 718877 4930202 004930202 00493020 ACTIVISION BLIZZARD, INC. ACTIVISION INC NEW
11 718877 4930202 004930202 00493020 ACTIVISION INC /NY ACTIVISION INC NEW
12 733269 5125109 005125109 00512510 ACXIOM CORP ACXIOM CORP
13 733269 5125109 005125109 00512510 ACXIOM HOLDINGS, INC. ACXIOM CORP
14 733269 5125109 005125109 00512510 LIVERAMP HOLDINGS, INC. ACXIOM CORP
15 733269 5125109 005125109 00512510 ACXIOM CORP ACXIOM HOLDINGS INC
16 733269 5125109 005125109 00512510 ACXIOM HOLDINGS, INC. ACXIOM HOLDINGS INC
17 733269 5125109 005125109 00512510 LIVERAMP HOLDINGS, INC. ACXIOM HOLDINGS INC
18 804212 9363102 009363102 00936310 AIRGAS INC AIRGAS INC
19 849116 4403101 004403101 00440310 ACE CASH EXPRESS INC/TX ACE CASH EXPRESS INC
20 879796 1930205 001930205 00193020 ARI NETWORK SERVICES INC /WI A R I NETWORK SERVICES INC
21 899629 4239109 004239109 00423910 ACADIA REALTY TRUST ACADIA REALTY TRUST
22 899629 4239109 004239109 00423910 MARK CENTERS TRUST ACADIA REALTY TRUST
23 907687 4934105 004934105 00493410 ACTEL CORP ACTEL CORP
24 908598 817R103 00817R103 00817R10 AETRIUM INC AETRIUM INC
25 908598 817R103 00817R103 00817R10 ATRM HOLDINGS, INC. AETRIUM INC
26 909276 504A108 00504A108 00504A10 ACTIONPOINT INC ACTIONPOINT INC
27 909276 504A108 00504A108 00504A10 CAPTIVA SOFTWARE CORP ACTIONPOINT INC
28 909276 504A108 00504A108 00504A10 CORNERSTONE IMAGING INC ACTIONPOINT INC
29 909276 504A108 00504A108 00504A10 INPUT SOFTWARE INC ACTIONPOINT INC
30 912601 4936100 004936100 00493610 ACRES GAMING INC ACRES GAMING INC
31 931784 2420107 002420107 00242010 APPLIED VOICE TECHNOLOGY INC /WA/ AVT CORPORATION
32 931784 2420107 002420107 00242010 AVT CORP AVT CORPORATION
33 931784 2420107 002420107 00242010 CAPTARIS INC AVT CORPORATION
34 934549 3881307 003881307 00388130 ACACIA RESEARCH CORP ACACIA RESEARCH CORP
35 935503 8363103 008363103 00836310 AG-CHEM EQUIPMENT CO INC AG CHEM EQUIPMENT INC
36 943003 8916108 008916108 00891610 AGRIUM INC AGRIUM INC
37 1002225 7974108 007974108 00797410 ADVENT SOFTWARE INC /DE/ ADVENT SOFTWARE INC
38 1021080 2553105 002553105 00255310 ABACUS DIRECT CORP ABACUS DIRECT CORP
39 1086844 9367103 009367103 00936710 AIRGATE PCS INC /DE/ AIRGATE P C S INC
40 1086844 9367103 009367103 00936710 AIRGATE WIRELESS INC AIRGATE P C S INC
41 1168213 7865108 007865108 00786510 AEROPOSTALE INC AEROPOSTALE INC
42 1168213 7865108 007865108 00786510 ARO LIQUIDATION, INC. AEROPOSTALE INC

All of the distinct cik-cusip combinations seem to be legitimate matches here. Furthermore, defining dataframes with the matches to cusipm.issuer and to crsp.stocknames respectively as m7_issuers and m7_stocknames

m7_issuers <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>% 
             filter(cusip_length.y == 9 & 
                     cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>% 
             distinct(cik, cusip.x, cusip.y) %>% 
             filter(grepl('^[0]{2}', cusip.y)) %>% 
             mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>% 
             filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>% 
             mutate(cusip6 = substr(cusip.y, 1, 6)) %>% 
             inner_join(issuers) %>% left_join(ciks) %>% 
            distinct(cik, cusip.x, cusip.y, cusip6, company_name, issuer_name_1)

m7_stocknames <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>% 
                               filter(cusip_length.y == 9 & 
                               cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>% 
                          distinct(cik, cusip.x, cusip.y) %>% 
                          filter(grepl('^[0]{2}', cusip.y)) %>% 
                          mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>% 
                          filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>% 
                          mutate(ncusip = substr(cusip.y, 1, 8)) %>% inner_join(stocknames) %>% 
                          left_join(ciks) %>% 
                          distinct(cik, cusip.x, cusip.y, ncusip, company_name, comnam)

We see that there are 25 distinct matches

> m7_issuers %>% 
  distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    25

and further that every match in m7_issuers is in m7_stocknames and vice versa.

> m7_stocknames %>% distinct(cik, cusip.x) %>% 
   rename(cusip = cusip.x) %>% anti_join(m7_issuers %>% 
   distinct(cik, cusip.x) %>% rename(cusip = cusip.x))
Joining, by = c("cik", "cusip")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, cusip <chr>
> m7_issuers %>% distinct(cik, cusip.x) %>% 
                               rename(cusip = cusip.x) %>% 
           anti_join(m7_stocknames %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x))
Joining, by = c("cik", "cusip")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, cusip <chr>
bdcallen commented 4 years ago

@iangow Like I did with my previous posts in #89 and #86, I'm going to discuss some detail of what I did, and then just state where you can find my result. Further down in my notebook handle_cusip_cik_exceptions.ipynb, you can find some cells defining dataframes added_value_7s_above_10_w_issuers, added_value_7s_above_10_stocknames, added_value_7s_above_10_1pad_w_issuers, added_value_7s_above_10_1pad_w_stocknames, added_value_7s_above_10_2pad_w_issuers and added_value_7s_above_10_2pad_w_stocknames. These are defined analogously to similar dataframes mentioned in #89 and #86, with the two dataframes containing "1pad" being rows where we consider cusip to be cusip_raw padded with a single zero on the left (ie. '5627733' becomes '05627733'), and those containing "2pad" being cases where we consider cusip to be cusip_raw padded with two zeros on the left (ie. '5627733' becomes '005627733').

Note also, that in the first two dataframes corresponding to the unpadded cases, I take cusip to be the first six characters of cusip_raw. The reason I did this was because the relation of the seventh digit to the cusip6 is not consistent: sometimes seventh digit is a check digit (like those found in the issuer_check field in cusipm.issuer), sometimes it is shorthand for some digit with two zeros (ie. if the digit is '1', the 2-digit issuer number (7th and 8th digits of a 9-digit cusip) is '10', with the check digit being '0', so short for '100'). So I took the cusip6 to be the cusip for the sake of simplicity and consistency. As the number of cases was relatively low, I also ignored approximate string matching.

You can find the result for the unpadded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 6

the result for the 1-padded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 8

and the result for the 2-padded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 9

I believe we can close this issue now.