mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Address cusips with 6 digits #88

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow As discussed in #84, this issue is for analysing cusips with 6 characters.

bdcallen commented 4 years ago
> cusip_cik %>% filter(cusip_length == 6) %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  2059
> cusip_cik %>% filter(cusip_length == 6) %>% inner_join(cusip_cik, by = c('file_name', 'cik'))  %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   801

> cusip_cik %>% filter(cusip_length == 6) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   653

> cusip_cik %>% filter(cusip_length == 6) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>%
> filter(cusip_length.y == 8) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    69

> cusip_cik %>% filter(cusip_length == 6) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 8) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    39

> cusip_cik %>% filter(cusip_length == 6) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik')) %>% 
> filter(cusip_length.y == 7) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    14

> cusip_cik %>% filter(cusip_length == 6) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 7) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1     5
bdcallen commented 4 years ago
> six_to_nine_df <- cusip_cik %>% filter(cusip_length == 6) %>% 
+                     inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
+                     filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x)
> six_to_nine_df
# A tibble: 653 x 2
       cik cusip 
     <int> <chr> 
 1  870480 45767A
 2  315999 638221
 3 1045280 68338A
 4  793279 136907
 5 1063227 125270
 6  790406 590175
 7    5172 026522
 8  910523 53215R
 9 1002176 985837
10    8699 042167
# … with 643 more rows
> cusip_cik %>% filter(cusip_length == 6) %>% anti_join(six_to_nine_df)
Joining, by = c("cusip", "cik")
# A tibble: 8,082 x 18
   file_name cusip check_digit    cik company_name formats cusip_length cusip6 num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4 cusip3_5 cusip2_6 cusip1_7
   <chr>     <chr>       <int>  <int> <chr>        <chr>          <int> <chr>             <dbl> <chr>  <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
 1 edgar/da… 3793…           1 8.32e5 GLOBALNET F… ABC                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 2 edgar/da… 3957…           8 9.12e5 GST TELECOM… A                  6 39573Q                5 39573Q 39573Q   39573Q   39573    3957Q    3953Q    3973Q    3573Q   
 3 edgar/da… 4063…           6 3.19e5 HALLWOOD EN… AC                 6 40636X                5 40636X 40636X   40636X   40636    4063X    4066X    4036X    4636X   
 4 edgar/da… 4063…           6 3.19e5 HALLWOOD EN… AC                 6 40636X                5 40636X 40636X   40636X   40636    4063X    4066X    4036X    4636X   
 5 edgar/da… 4532…           2 8.19e5 IN HOME HEA… AC                 6 453222                4 453222 453222   453222   45322    45322    45322    45222    43222   
 6 edgar/da… 4273…           3 4.70e4 HERLEY INDU… AC                 6 427398                6 427398 427398   427398   42739    42738    42798    42398    47398   
 7 edgar/da… 0216…           6 1.02e6 HERTZ TECHN… AC                 6 021679                6 021679 021679   021679   02167    02169    02179    02679    01679   
 8 edgar/da… 5904…           2 1.07e6 MERRY LAND … AC                 6 590441                5 590441 590441   590441   59044    59041    59041    59441    50441   
 9 edgar/da… 4273…           3 4.70e4 HERLEY INDU… AC                 6 427398                6 427398 427398   427398   42739    42738    42798    42398    47398   
10 edgar/da… 4532…           2 8.19e5 IN HOME HEA… AC                 6 453222                4 453222 453222   453222   45322    45322    45322    45222    43222   
# … with 8,072 more rows, and 1 more variable: cusip0_8 <chr>
> cusip_cik %>% filter(cusip_length == 6) %>% anti_join(six_to_nine_df) %>% distinct(cik, cusip)
Joining, by = c("cusip", "cik")
# A tibble: 1,406 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  922814 927649
10  922814 972649
# … with 1,396 more rows
> six_no_nine_df <- cusip_cik %>% filter(cusip_length == 6) %>% anti_join(six_to_nine_df) %>% distinct(cik, cusip)
Joining, by = c("cusip", "cik")
> six_no_nine_df
# A tibble: 1,406 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  922814 927649
10  922814 972649
# … with 1,396 more rows
> six_no_nine_df %>% group_by(cik) %>% summarise(num = n())
# A tibble: 1,333 x 2
     cik   num
   <int> <int>
 1  1800     1
 2  2093     1
 3  2098     1
 4  3545     1
 5  4164     1
 6  5320     1
 7  6715     1
 8  6814     1
 9  8302     1
10  8504     2
# … with 1,323 more rows
> six_no_nine_df %>% group_by(cik) %>% summarise(num = n()) %>% filter(num > 1)
# A tibble: 71 x 2
      cik   num
    <int> <int>
 1   8504     2
 2  67383     2
 3  80737     2
 4 101265     2
 5 106413     2
 6 108516     2
 7 701856     3
 8 725876     2
 9 749660     2
10 760461     2
# … with 61 more rows
> six_no_nine_df
# A tibble: 1,406 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  922814 927649
10  922814 972649
# … with 1,396 more rows
> six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8)
# A tibble: 2,225 x 19
      cik cusip.x file_name cusip.y check_digit company_name formats cusip_length cusip6 num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4 cusip3_5 cusip2_6
    <int> <chr>   <chr>     <chr>         <int> <chr>        <chr>          <int> <chr>             <dbl> <chr>  <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
 1 911522 39573Q  edgar/da… 361942…           5 GST TELECOM… AB                 8 361942                7 36194… 3619421  3619420  3619410  3619210  3614210  3694210 
 2 911522 39573Q  edgar/da… 361942…           5 GST TELECOM… ABCD               8 361942                7 36194… 3619421  3619420  3619410  3619210  3614210  3694210 
 3 911522 39573Q  edgar/da… 361942…           5 GST TELECOM… ABCD               8 361942                7 36194… 3619421  3619420  3619410  3619210  3614210  3694210 
 4 911522 39573Q  edgar/da… 361942…           5 GST TELECOM… AB                 8 361942                7 36194… 3619421  3619420  3619410  3619210  3614210  3694210 
 5  47035 427398  edgar/da… SEECOV…           1 HERLEY INDU… D                  8 SEECOV                6 SEECO… SEECOVE  SEECOVR  SEECOER  SEECVER  SEEOVER  SECOVER 
 6  47035 427398  edgar/da… SEECOV…           1 HERLEY INDU… D                  8 SEECOV                6 SEECO… SEECOVE  SEECOVR  SEECOER  SEECVER  SEEOVER  SECOVER 
 7  47035 427398  edgar/da… 427398…           2 HERLEY INDU… ACD                8 427398                8 42739… 4273981  4273980  4273910  4273810  4279810  4239810 
 8  47035 427398  edgar/da… 427398…           2 HERLEY INDU… ACD                8 427398                8 42739… 4273981  4273980  4273910  4273810  4279810  4239810 
 9  47035 427398  edgar/da… 427398…           2 HERLEY INDU… ACD                8 427398                8 42739… 4273981  4273980  4273910  4273810  4279810  4239810 
10  47035 427398  edgar/da… 427398…           2 HERLEY INDU… ACD                8 427398                8 42739… 4273981  4273980  4273910  4273810  4279810  4239810 
# … with 2,215 more rows, and 2 more variables: cusip1_7 <chr>, cusip0_8 <chr>
> six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 312 x 2
       cik cusip 
     <int> <chr> 
 1  911522 39573Q
 2   47035 427398
 3  726293 82845R
 4   20041 170268
 5 1060801 22674V
 6   25793 227478
 7  804312 683815
 8  840216 42550H
 9  808015 286133
10  842638 925313
# … with 302 more rows
> six_to_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 199 x 2
       cik cusip 
     <int> <chr> 
 1  870480 45767A
 2 1045280 68338A
 3  793279 136907
 4  790406 590175
 5  910523 53215R
 6  949536 422317
 7 1088147 00504E
 8  879933 462622
 9  720031 204682
10  724606 050103
# … with 189 more rows
> six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 312 x 2
       cik cusip 
     <int> <chr> 
 1  911522 39573Q
 2   47035 427398
 3  726293 82845R
 4   20041 170268
 5 1060801 22674V
 6   25793 227478
 7  804312 683815
 8  840216 42550H
 9  808015 286133
10  842638 925313
# … with 302 more rows
> six_no_nine_df
# A tibble: 1,406 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  922814 927649
10  922814 972649
# … with 1,396 more rows
> cusip_cik %>% filter(cusip_length == 6) %>% distinct(cik, cusip)
# A tibble: 2,059 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  870480 45767A
10  922814 927649
# … with 2,049 more rows
> six_no_nine_df
# A tibble: 1,406 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  911522 39573Q
 3  319019 40636X
 4  818645 453222
 5   47035 427398
 6 1020726 021679
 7 1069546 590441
 8  945372 454758
 9  922814 927649
10  922814 972649
# … with 1,396 more rows
> six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 312 x 2
       cik cusip 
     <int> <chr> 
 1  911522 39573Q
 2   47035 427398
 3  726293 82845R
 4   20041 170268
 5 1060801 22674V
 6   25793 227478
 7  804312 683815
 8  840216 42550H
 9  808015 286133
10  842638 925313
# … with 302 more rows
> six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 312 x 2
       cik cusip 
     <int> <chr> 
 1  911522 39573Q
 2   47035 427398
 3  726293 82845R
 4   20041 170268
 5 1060801 22674V
 6   25793 227478
 7  804312 683815
 8  840216 42550H
 9  808015 286133
10  842638 925313
# … with 302 more rows
> six_no_nine_but_eight_df <- six_no_nine_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 8) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
> six_no_89_df <- six_no_nine_df %>% anti_join(six_no_nine_but_eight_df)
Joining, by = c("cik", "cusip")
> six_no_89_df
# A tibble: 1,094 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  319019 40636X
 3  818645 453222
 4 1020726 021679
 5 1069546 590441
 6  945372 454758
 7  922814 927649
 8  922814 972649
 9  717238 554806
10 1000168 83559I
# … with 1,084 more rows
> six_no_89_df inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 7) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
Error: unexpected symbol in "six_no_89_df inner_join"
> six_no_89_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 7) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
# A tibble: 25 x 2
       cik cusip 
     <int> <chr> 
 1  884650 456910
 2  723125 595112
 3  852952 919610
 4  909727 586818
 5  943003 008916
 6 1078295 56781Q
 7 1047919 O23723
 8 1047919 023723
 9  865439 4063T5
10  913955 05155P
# … with 15 more rows
> six_no_89_but_7_df <- six_no_89_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length == 7) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) 
> just_six_df <- six_no_89_df %>% anti_join(six_no_89_but_7_df)
Joining, by = c("cik", "cusip")
> just_six_df
# A tibble: 1,069 x 2
       cik cusip 
     <int> <chr> 
 1  832324 37937R
 2  319019 40636X
 3  818645 453222
 4 1020726 021679
 5 1069546 590441
 6  945372 454758
 7  922814 927649
 8  922814 972649
 9  717238 554806
10 1000168 83559I
# … with 1,059 more rows
> just_six_df %>% inner_join(cusip_cik)
Joining, by = c("cik", "cusip")
# A tibble: 6,101 x 18
      cik cusip file_name check_digit company_name formats cusip_length cusip6 num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4 cusip3_5 cusip2_6 cusip1_7
    <int> <chr> <chr>           <int> <chr>        <chr>          <int> <chr>             <dbl> <chr>  <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
 1 832324 3793… edgar/da…           1 GLOBALNET F… ABC                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 2 832324 3793… edgar/da…           1 GLOBALNET F… ABC                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 3 832324 3793… edgar/da…           1 GLOBALNET F… ABC                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 4 832324 3793… edgar/da…           1 GLOBALNET F… ACD                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 5 832324 3793… edgar/da…           1 GLOBALNET F… ACD                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 6 832324 3793… edgar/da…           1 GLOBALNET F… ABC                6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 7 832324 3793… edgar/da…           1 GLOBALNET F… AC                 6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 8 832324 3793… edgar/da…           1 GLOBALNET F… AC                 6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
 9 832324 3793… edgar/da…           1 GLOBALNET F… AC                 6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
10 832324 3793… edgar/da…           1 GLOBALNET F… AC                 6 37937R                4 37937R 37937R   37937R   37937    3793R    3797R    3737R    3937R   
# … with 6,091 more rows, and 1 more variable: cusip0_8 <chr>
> just_six_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, check_digit, company_name)
Joining, by = c("cik", "cusip")
# A tibble: 1,111 x 4
       cik cusip  check_digit company_name               
     <int> <chr>        <int> <chr>                      
 1  832324 37937R           1 GLOBALNET FINANCIAL COM INC
 2  319019 40636X           6 HALLWOOD ENERGY CORP       
 3  818645 453222           2 IN HOME HEALTH INC /MN/    
 4 1020726 021679           6 HERTZ TECHNOLOGY GROUP INC 
 5 1069546 590441           2 MERRY LAND PROPERTIES INC  
 6  945372 454758           4 SIGCORP INC                
 7  922814 927649           4 VIRBAC CORP                
 8  922814 972649           8 VIRBAC CORP                
 9  717238 554806           0 MACNEAL SCHWENDLER CORP    
10 1000168 83559I           2 BIOMETRIC SECURITY CORP/BC 
# … with 1,101 more rows
> just_six_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, check_digit, company_name) %>% rename(ncusip = cusip)
Joining, by = c("cik", "cusip")
# A tibble: 1,111 x 4
       cik ncusip check_digit company_name               
     <int> <chr>        <int> <chr>                      
 1  832324 37937R           1 GLOBALNET FINANCIAL COM INC
 2  319019 40636X           6 HALLWOOD ENERGY CORP       
 3  818645 453222           2 IN HOME HEALTH INC /MN/    
 4 1020726 021679           6 HERTZ TECHNOLOGY GROUP INC 
 5 1069546 590441           2 MERRY LAND PROPERTIES INC  
 6  945372 454758           4 SIGCORP INC                
 7  922814 927649           4 VIRBAC CORP                
 8  922814 972649           8 VIRBAC CORP                
 9  717238 554806           0 MACNEAL SCHWENDLER CORP    
10 1000168 83559I           2 BIOMETRIC SECURITY CORP/BC 
# … with 1,101 more rows
> just_six_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, check_digit, company_name) %>% rename(ncusip = cusip) %>% inner_join(stocknames, by = 'ncusip')
Joining, by = c("cik", "cusip")
# A tibble: 0 x 19
# … with 19 variables: cik <int>, ncusip <chr>, check_digit <int>, company_name <chr>, permno <int>, permco <int>, namedt <date>, nameenddt <date>, cusip <chr>,
#   ticker <chr>, comnam <chr>, hexcd <dbl>, exchcd <dbl>, siccd <dbl>, shrcd <dbl>, shrcls <chr>, st_date <date>, end_date <date>, namedum <dbl>
bdcallen commented 4 years ago
> just_six_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, check_digit, company_name) %>% mutate(cusip6 = cusip) %>% inner_join(stocknames %>% mutate(cusip6 = substr(ncusip, 1, 6)), by = 'cusip6') %>% distinct(cik, cusip.x) %>% count()
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
      n
  <int>
1   620
> six_not_in_stocknames_df <- just_six_df %>% anti_join(just_six_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, check_digit, company_name) %>% mutate(cusip6 = cusip) %>% inner_join(stocknames %>% mutate(cusip6 = substr(ncusip, 1, 6)), by = 'cusip6') %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x))
Joining, by = c("cik", "cusip")
Joining, by = c("cik", "cusip")
> six_not_in_stocknames_df
# A tibble: 449 x 2
       cik cusip 
     <int> <chr> 
 1 1020726 021679
 2  922814 972649
 3 1000168 83559I
 4  929646 024848
 5 1059671 14147Q
 6  924648 PAGEOF
 7  869986 122898
 8 1066808 CLASSB
 9 1035039 46185N
10 1035451 84763K
# … with 439 more rows
> issuers
# A tibble: 244,062 x 23
   cusip6 issuer_check issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1 issuer_adl_2 issuer_adl_3 issuer_adl_4 issuer_sort_key issuer_type issuer_status
 * <chr>  <chr>        <chr>         <chr>         <chr>         <chr>        <chr>        <chr>        <chr>        <chr>           <chr>       <chr>        
 1 000021 6            A & A INTL I… NA            NA            FORMERLY A … 08/05/1997   NA           NA           A & A FOODS LT… C           A            
 2 000024 0            A & B TRANSN… NA            NA            NA           NA           NA           NA           A & B TRANSPOR… C           A            
 3 000027 3            A & E CAP FD… NA            NA            NA           NA           NA           NA           A & E CAPITAL … C           A            
 4 00002T 5            A&B GEOSCIEN… NA            NA            NAME CHANGE… CORP 05/05/… NA           NA           A & B GEOSCIEN… C           D            
 5 000032 3            A & E PLASTI… NA            NA            CONVERTED T… 06/22/1978   NA           NA           A & E PLASTIK … C           D            
 6 000060 4            A& ECONOMY L… NA            NA            NA           NA           NA           NA           A & ECONOMY LO… C           A            
 7 00007T 0            A & F ENTMT … NA            NA            NA           NA           NA           NA           A & F ENTERTAI… C           A            
 8 000096 8            A & F INVEST… NA            NA            NA           NA           NA           NA           A & F INVESTIN… C           A            
 9 0000CG 2            CGA MINING L… CDS-          NA            NA           NA           NA           NA           CGA MINING LTD… C           A            
10 000101 6            A & J VENTUR… NA            NA            NAME CHANGE… 02/24/2011 … NA           NA           A & J VENTURE … C           D            
# … with 244,052 more rows, and 11 more variables: issuer_del_date <date>, issuer_transaction <chr>, issuer_state_code <chr>, issuer_update_date <date>,
#   cabre_id <chr>, cabre_status <chr>, lei_cici <chr>, legal_entity_name <chr>, previous_name <chr>, entry_date <date>, cp_institution_type <chr>
> six_not_in_stocknames_df %>% mutate(cusip6 = cusip) %>% inner_join(issuers, by = 'cusip6')
# A tibble: 290 x 25
      cik cusip cusip6 issuer_check issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1 issuer_adl_2 issuer_adl_3 issuer_adl_4 issuer_sort_key issuer_type
    <int> <chr> <chr>  <chr>        <chr>         <chr>         <chr>         <chr>        <chr>        <chr>        <chr>        <chr>           <chr>      
 1 1.06e6 1414… 14147Q 8            CARDINAL AIR… NA            NA            NA           NA           NA           NA           CARDINAL AIRLI… C          
 2 1.04e6 4618… 46185N 0            INVU INC      NA            NA            NA           NA           NA           NA           INVU INC      … C          
 3 1.04e6 8476… 84763K 7            SPECTRUMEDIX… NA            NA            NA           NA           NA           NA           SPECTRUMEDIX C… C          
 4 1.08e6 2940… 29408K 3            ENVIRONMENTA… WORLDWIDE INC NA            NA           NA           NA           NA           ENVIRONMENTAL … C          
 5 1.08e6 8803… 88032R 8            TENDER LOVIN… SVCS INC NEW  NA            NA           NA           NA           NA           TENDER LOVING … C          
 6 9.00e5 7811… 781132 6            RUBIDELL REC… NA            NA            NAME CHANGE… RESOURCE CO… SEE 147603   NA           RUBIDELL REC I… C          
 7 1.06e6 1261… 12612F 2            CNB HLDGS IN… NA            NA            NAME CHANGE… BANCORP INC… SEE 319430   NA           C N B HOLDINGS… C          
 8 1.10e6 2566… 256655 2            DOLLAR BANCO… NA            NA            NA           NA           NA           NA           DOLLAR BANCORP… C          
 9 7.91e5 7005… 700505 1            PARK GROUP L… NA            NA            NAME CHANGE… COMMUNICATI… 04/07/1999 … NA           PARK GROUP LTD… C          
10 7.54e5 5017… 501792 6            LBO CAP CORP  NA            NA            NA           NA           NA           NA           L B O CAPITAL … C          
# … with 280 more rows, and 12 more variables: issuer_status <chr>, issuer_del_date <date>, issuer_transaction <chr>, issuer_state_code <chr>,
#   issuer_update_date <date>, cabre_id <chr>, cabre_status <chr>, lei_cici <chr>, legal_entity_name <chr>, previous_name <chr>, entry_date <date>,
#   cp_institution_type <chr>
> six_not_in_stocknames_df %>% mutate(cusip6 = cusip) %>% inner_join(issuers, by = 'cusip6') %>% count()
# A tibble: 1 x 1
      n
  <int>
1   290
bdcallen commented 4 years ago
cusip6s <- cusip_cik %>% filter(cusip_length == 6)

six_w_789 <- cusip6s %>% inner_join(cusip_cik, by = 'cik') %>% 
    filter(cusip_length.y >= 7) %>% distinct(cik, cusip.x, multiplicity.x) %>% 
    rename(cusip = cusip.x, multiplicity = multiplicity.x) %>% 
    filter(multiplicity >= 10) %>% distinct(cik, cusip) 

> cusip6s %>% filter(multiplicity >= 10) %>% anti_join(six_w_789) %>% distinct(cik, cusip) 
Joining, by = c("cik", "cusip")
# A tibble: 25 x 2
       cik cusip 
     <int> <chr> 
 1  351902 883622
 2  749750 37936T
 3  814427 719350
 4  832324 37937R
 5  914791 025164
 6  930884 351681
 7  934849 286128
 8 1007023 892917
 9 1022263 687496
10 1027599 25243T
# … with 15 more rows

View(cusip6s %>% filter(multiplicity >= 10) %>% 
         anti_join(six_w_789) %>% distinct(cik, cusip) %>% 
         mutate(cusip6 = cusip) %>% inner_join(issuers) %>% 
         left_join(ciks) %>% 
        distinct(cik, cusip, company_name, issuer_name_1, 
                     issuer_name_2, issuer_adl_1, issuer_adl_2))
cik cusip company_name issuer_name_1 issuer_name_2 issuer_adl_1 issuer_adl_2
1 351902 883622 THERMODYNETICS INC THERMODYNETICS INC NA NA NA
2 749750 37936T GLOBALTEX INDUSTRIES INC GLOBALTEX INDS INC NA NAME CHANGED TO PINE VY MNG CORP 5/14/2003      SEE 72301P
3 749750 37936T PINE VALLEY MINING CORP GLOBALTEX INDS INC NA NAME CHANGED TO PINE VY MNG CORP 5/14/2003      SEE 72301P
4 814427 719350 NATIONAL MANUFACTURING TECHNOLOGIES PHOTOMATRIX INC NA NAME CHANGED TO NATIONAL MFG TECHNOLOGIES INC  10/06/1999
5 814427 719350 PHOTOMATRIX INC/ CA PHOTOMATRIX INC NA NAME CHANGED TO NATIONAL MFG TECHNOLOGIES INC  10/06/1999
6 814427 719350 PHOTOMATRIX INC PHOTOMATRIX INC NA NAME CHANGED TO NATIONAL MFG TECHNOLOGIES INC  10/06/1999
7 814427 719350 XSCRIBE CORP /CA/ PHOTOMATRIX INC NA NAME CHANGED TO NATIONAL MFG TECHNOLOGIES INC  10/06/1999
8 832324 37937R CAPITAL GROWTH HOLDINGS LTD /DE/ GLOBALNET FINL COM INC NA NA NA
9 832324 37937R CAPTIAL GROWTH HOLDINGS LTD GLOBALNET FINL COM INC NA NA NA
10 832324 37937R GALT FINANCIAL CORP GLOBALNET FINL COM INC NA NA NA
11 832324 37937R GLOBALNET FINANCIAL COM INC GLOBALNET FINL COM INC NA NA NA
12 832324 37937R MICROCAP FINANCIAL SERVICES INC GLOBALNET FINL COM INC NA NA NA
13 914791 025164 LUCOR INC /FL/ AMERICAN COLD LITE TECHNOLOGIES INC NA NA
14 930884 351681 FRACTAL DESIGN CORP FRACTAL DESIGN CORP NA NA NA
15 934849 286128 ELECTROPHARMACOLOGY INC ELECTROPHARMACOLOGY INC NA NA NA
16 1007023 892917 TRANS ORIENT PETROLEUM LTD/ TRANS ORIENT PETE LTD NA NA NA
17 1007023 892917 TRANS-ORIENT PETROLEUM LTD. TRANS ORIENT PETE LTD NA NA NA
18 1022263 687496 CONQUISTADOR MINES LTD ORSA VENTURES CORP NA NA NA
19 1022263 687496 ORSA VENTURES CORP ORSA VENTURES CORP NA NA NA
20 1027599 25243T DIADEM RESOURCES LTD DIADEM RES LTD NA NA NA
21 1081630 42222H HEALTHGATE DATA CORP HEALTHGATE DATA CORP NA NA NA
22 1082278 29408K ENVIRONMENTAL SOLUTIONS WORLDWIDE INC ENVIRONMENTAL SOLUTIONS WORLDWIDE INC NA NA
23 1082518 133659 ARCTOS PETROLEUM CORP. CAMFLO INTL INC NA REORGANIZED AS ARCTOS PETE CORP 10/06/2004
24 1082518 133659 CAMFLO INTERNATIONAL INC CAMFLO INTL INC NA REORGANIZED AS ARCTOS PETE CORP 10/06/2004
25 1082518 133659 CAMFLO RESOURCES LTD CAMFLO INTL INC NA REORGANIZED AS ARCTOS PETE CORP 10/06/2004
26 1090014 38259N ESSEX CAPITAL CORP GOOLU INC NA NA NA
27 1090014 38259N GOOLU INC GOOLU INC NA NA NA
28 1092197 855699 STRATABASE COM STRATABASE NA FORMERLY STRATABASE COM TO 06/25/2001 & REORGANIZED AS
29 1092197 855699 STRATABASE STRATABASE NA FORMERLY STRATABASE COM TO 06/25/2001 & REORGANIZED AS
30 1102448 87971D DIRECTV BROADBAND INC TELOCITY INC NA NA NA
31 1102448 87971D TELOCITY DELAWARE INC TELOCITY INC NA NA NA
32 1102448 87971D TELOCITY INC TELOCITY INC NA NA NA
33 1157667 98583W YI WAM GROUP INC YI WAN GROUP INC NA NAME CHANGED TO U S NATIONAL TELECOME INC 09/26/2007
34 1157667 98583W YI WAN GROUP INC YI WAN GROUP INC NA NAME CHANGED TO U S NATIONAL TELECOME INC 09/26/2007
35 1288268 541095 LOGAN RESOURCES LTD LOGAN RES LTD CDA NA NA NA
36 1322952 77543T ROKWADER, INC. ROKWADER INC NA NA NA
37 1322952 77543T TRUE BLUE HOLDINGS, INC. ROKWADER INC NA NA NA
38 1355242 050751 AUDIO STOCKS, INC. AUDIOSTOCKS INC NA NAME CHANGED TO SHRINK NANOTECHNOLOGIES INC
39 1355242 050751 SHRINK NANOTECHNOLOGIES, INC. AUDIOSTOCKS INC NA NAME CHANGED TO SHRINK NANOTECHNOLOGIES INC
40 1357939 74345A IVECON CORP PROPER PWR & ENERGY INC NA NA NA
41 1357939 74345A PROPER POWER & ENERGY, INC. PROPER PWR & ENERGY INC NA NA NA
42 1502554 11283W BROOKFIELD RESIDENTIAL PROPERTIES INC. BROOKFIELD RESIDENTIAL PPTYS INC NA NA
43 1649624 691903 MIDLAND EXPLORATION INC. OXFORD RES INC NA NAME CHANGED TO MIDLAND EXPL INC 05/03/2005      SEE 59751U
bdcallen commented 4 years ago

Checking the other cases in six_w_789, it doesn't appear that 6 digit cusips could be derived from 9-digit cusips with three leading zeroes, from the evidence below

View(six_w_789 %>% 
         mutate(cusip9 = paste0('000', cusip), cusip6 = substr(cusip9, 1, 6)) %>% 
         inner_join(issuers) %>% left_join(ciks) %>% 
         select(cik, cusip, cusip9, company_name, issuer_name_1))
cik cusip cusip9 company_name issuer_name_1
1 6814 544118 000544118 COMFORCE CORP A B C CIGAR CO
2 6814 544118 000544118 LORI CORP A B C CIGAR CO
3 36840 336142 000336142 FIRST REAL ESTATE INVESTMENT TRUST OF NEW JERSEY AAG HLDG INC
4 81955 752185 000752185 RAND CAPITAL CORP ABC RAIL PRODS CORP
5 101265 901637 000901637 UNITED ILLUMINATING CO ABT INVT SER INC
6 310568 940570 000940570 MICROSEMI CORP AFG INVT TR
7 351917 902951 000902951 U S ENERGY SYSTEMS INC AGF HITECH FD LTD
8 351917 902951 000902951 U S ENVIROSYSTEMS INC /DE/ AGF HITECH FD LTD
9 356708 33615C 00033615C FIRST REGIONAL BANCORP AAG HLDG INC
10 701374 886506 000886506 PREMIER PARKS INC ADC TELECOMMUNICATIONS INC
11 701374 886506 000886506 SIX FLAGS ENTERTAINMENT CORP ADC TELECOMMUNICATIONS INC
12 701374 886506 000886506 SIX FLAGS INC ADC TELECOMMUNICATIONS INC
13 701374 886506 000886506 SIX FLAGS, INC. ADC TELECOMMUNICATIONS INC
14 701374 886506 000886506 TIERCO GROUP INC/DE/ ADC TELECOMMUNICATIONS INC
15 714562 320218 000320218 FIRST FINANCIAL CORP /IN/ A A CORP
16 723615 872595 000872595 TNR TECHNICAL INC ACS ENTERPRISES INC
17 725876 92672V 00092672V DRAFTDAY FANTASY SPORTS, INC. AGF GROWTH EQUITY FD LTD
18 725876 92672V 00092672V FUNCTION (X) INC. AGF GROWTH EQUITY FD LTD
19 725876 92672V 00092672V FUNCTION(X) INC. AGF GROWTH EQUITY FD LTD
20 725876 92672V 00092672V GATEWAY INDUSTRIES INC /CA/ AGF GROWTH EQUITY FD LTD
21 725876 92672V 00092672V GATEWAY INDUSTRIES INC /DE/ AGF GROWTH EQUITY FD LTD
22 725876 92672V 00092672V VIGGLE INC. AGF GROWTH EQUITY FD LTD
23 726293 82845R 00082845R AGRICON GLOBAL CORP ACA AQUARIUS 2006 1 LTD
24 726293 82845R 00082845R BAYHILL CAPITAL CORP ACA AQUARIUS 2006 1 LTD
25 726293 82845R 00082845R COGNIGEN NETWORKS INC ACA AQUARIUS 2006 1 LTD
26 726293 82845R 00082845R CRYPTOSIGN, INC. ACA AQUARIUS 2006 1 LTD
27 726293 82845R 00082845R NABUFIT GLOBAL, INC. ACA AQUARIUS 2006 1 LTD
28 726293 82845R 00082845R NEWBRIDGE GLOBAL VENTURES, INC. ACA AQUARIUS 2006 1 LTD
29 726293 82845R 00082845R SILVERTHORNE PRODUCTION CO ACA AQUARIUS 2006 1 LTD
30 726293 82845R 00082845R STRATEGABIZ, INC. ACA AQUARIUS 2006 1 LTD
31 811808 86606G 00086606G SOUTH BRANCH VALLEY BANCORP INC ACI INC
32 811808 86606G 00086606G SUMMIT FINANCIAL GROUP INC ACI INC
33 812128 800013 000800013 SANDERSON FARMS INC ACF INDS INC
34 820067 866013 000866013 SUMMIT BANK CORP ACI INC
35 821899 88101U 00088101U TERRACE VENTURES INC A CONSULTING TEAM INC
36 828064 77316P 00077316P ROCKFORD CORP ABKA INC
37 842635 552078 000552078 LYONDELL CHEMICAL CO ABC COAL INC
38 842635 552078 000552078 LYONDELL PETROCHEMICAL CO ABC COAL INC
39 846660 809340 000809340 SCOTSMAN INDUSTRIES INC ACA ABS 2004-1 LTD / ACA ABS
40 851064 90169P 00090169P ATLANTA GOLD INC. ABT INVT SER INC
41 851064 90169P 00090169P TWIN MINING CORP ABT INVT SER INC
42 852952 919610 000919610 WESTPOINT STEVENS INC ACM MANAGED INCOME FD INC
43 854860 969465 000969465 WILLIAMS CONTROLS INC AAZ CORP
44 873998 112986 000112986 INTERLOTT TECHNOLOGIES INC A & K FDG LLC
45 873998 112986 000112986 INTERNATIONAL LOTTERY INC A & K FDG LLC
46 908315 908315 000908315 GROW BIZ INTERNATIONAL INC ACS SYS INC
47 908315 908315 000908315 WINMARK CORP ACS SYS INC
48 918709 955104 000955104 ACT TELECONFERENCING INC ACT TELECONFERENCING INC
49 920521 02145R 00002145R ALTERNATIVE RESOURCES CORP A & A INTL INDS INC
50 929646 024848 000024848 EAST TEXAS FINANCIAL SERVICES INC A & B TRANSN SERVICES INC
51 930481 871951 000871951 SYMPHONIX DEVICES INC ACP INTL INC
52 936130 749927 000749927 RSI SYSTEMS INC/MN ABCO MNG LTD
53 936130 749927 000749927 VISEON INC ABCO MNG LTD
54 946343 92923B 00092923B WFS FINANCIAL INC ACM INSTL RESVS INC
55 1002531 889542 000889542 TOLLGRADE COMMUNICATIONS INC \PA\ A D DATA SYS INC
56 1005508 027716 000027716 YONKERS FINANCIAL CORP A & E CAP FDG INC
57 1013785 380755 000380755 GOLDBELT RESOURCES LTD ABCAM PLC
58 1020011 91702P 00091702P VIRGINIA GOLD MINES INC ACM GOVT SPECTRUM FD INC
59 1020011 91702P 00091702P VIRGINIA MINES INC. ACM GOVT SPECTRUM FD INC
60 1020011 91702P 00091702P VIRGINIA MINES INC ACM GOVT SPECTRUM FD INC
61 1035146 915200 000915200 INTRAWEST CORP ADT LTD
62 1065857 300867 000300867 EXCHANGE APPLICATIONS INC AAA U S GOVT MONEY MKT ACCOUNT
63 1072627 904103 000904103 KINGSWAY FINANCIAL SERVICES INC ACI CORP
64 1124160 989555 000989555 PRIVATE CAPITAL INVESTORS INC AGF LTD PARTNERSHIP 1997
65 1124160 989555 000989555 ZIM CORP AGF LTD PARTNERSHIP 1997
66 1137778 92923C 00092923C WCI COMMUNITIES INC ACM INSTL RESVS INC
67 1206264 88023U 00088023U TEMPUR PEDIC INTERNATIONAL INC A CO INC
68 1206264 88023U 00088023U TEMPUR SEALY INTERNATIONAL, INC. A CO INC
69 1206264 88023U 00088023U TWI HOLDINGS INC A CO INC
70 1360214 89363T 00089363T BYWATER RESOURCES, INC ADI ELECTRS INC
71 1360214 89363T 00089363T HARROW HEALTH, INC. ADI ELECTRS INC
72 1360214 89363T 00089363T IMPRIMIS PHARMACEUTICALS, INC. ADI ELECTRS INC
73 1360214 89363T 00089363T TRANSDEL PHARMACEUTICALS INC ADI ELECTRS INC
74 1371424 78033P 00078033P CENTRUS VENTURES INC. ABN AMRO MTG CORP
75 1371424 78033P 00078033P ROYAL MINES & MINERALS CORP ABN AMRO MTG CORP
76 1392972 74346Y 00074346Y PROS HOLDINGS, INC. ABCLO 2007-1 LTD
77 1413837 32026V 00032026V FIRST FOUNDATION INC. A A CORP
78 1413837 32026V 00032026V KELLER FINANCIAL GROUP A A CORP
79 1498021 30066A 00030066A EXAMWORKS GROUP, INC. AAA U S GOVT MONEY MKT ACCOUNT
80 1562733 87165D 00087165D ORO CAPITAL CORPORATION, INC. ACP INTL INC
81 1562733 87165D 00087165D SYNERGY CHC CORP. ACP INTL INC
82 1562733 87165D 00087165D SYNERGY STRIPS CORP. ACP INTL INC
bdcallen commented 4 years ago

In fact, I just looked at the matches with a leading '000' amongst the whole set; these were the only ones I could find

> cusip6s %>% mutate(cusip9 = paste0('000', cusip), cusip6 = substr(cusip9, 1, 6)) %>% inner_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip, cusip9, cusip6, company_name, issuer_name_1, issuer_name_2, issuer_adl_1, multiplicity) %>% filter(cik %in% c(1750, 918709, 1038727))
Joining, by = "cusip6"
Joining, by = c("cik", "company_name")
# A tibble: 3 x 9
      cik cusip  cusip9    cusip6 company_name             issuer_name_1            issuer_name_2 issuer_adl_1 multiplicity
    <int> <chr>  <chr>     <chr>  <chr>                    <chr>                    <chr>         <chr>               <int>
1    1750 361105 000361105 000361 AAR CORP                 AAR CORP                 NA            NA                      6
2  918709 955104 000955104 000955 ACT TELECONFERENCING INC ACT TELECONFERENCING INC NA            NA                     22
3 1038727 937102 000937102 000937 ABN AMRO HOLDING N V     ABN AMRO HLDG N V        NA            NA                      2
bdcallen commented 4 years ago

@iangow Like I did with my previous posts in #89, #86 and #87, I'm going to discuss some detail of what I did, and then just state where you can find my result. Further down in my notebook handle_cusip_cik_exceptions.ipynb, you can find some cells defining dataframes added_value_6s_above_10_w_issuers, added_value_6s_above_10_stocknames, added_value_6s_above_10_1pad_w_issuers, added_value_6s_above_10_1pad_w_stocknames, added_value_6s_above_10_2pad_w_issuers, added_value_6s_above_10_2pad_w_stocknames, added_value_6s_above_10_3pad_w_issuers and added_value_6s_above_10_3pad_w_stocknames. These are defined analogously to similar dataframes mentioned #89, #86 and #87, with the last two dataframes considering the additional case where we consider 3-padded cusips (ie. '5627733' becomes '0005627733').

Note that for the 1-padded cases, I take cusip to be the first six characters of cusip_raw padded with a zero on the left, for a similar reason that I did for the unpadded 7-digit cusips considered in #87. As the number of cases was relatively low, I also ignored approximate string matching.

You can find the result for the unpadded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip) = 6 AND LENGTH(cusip_raw) = 6
AND cusip = cusip_raw

the result for the 1-padded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip) = 6 AND LENGTH(cusip_raw) = 6
AND cusip != cusip_raw

the result for the 2-padded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 6 AND LENGTH(cusip) = 8

and the result for 3-padded cusips

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 6 AND LENGTH(cusip) = 9

I believe we can close this issue now.