mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Address cusips with 8 digits #86

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow As discussed in #84, this issue is for analysing cusips with 8 characters.

bdcallen commented 4 years ago
> cusip_cik %>% filter(cusip_length == 8) %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  6156

> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>%
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  2302

> cusip_cik %>% filter(cusip_length == 8) %>% 
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   880
iangow commented 4 years ago

@bdcallen

I think what you want to do is create data frame from the 9-digit CUSIPs, then drop the last digit from those CUSIPs. Then ask

In looking at 7-digit CUSIPs, I think that (pretty much) the only way that they would be valid would be if they were 8-digit CUSIPs with a leading "0" dropped. So I think for those, you'd repeat the analysis above, but converting both 8- and 9-digit CUSIPs into seven-digit CUSIPs.

It seems here that you are looking at 8- and 9-digit CUSIPs in the same filing. I think we already have an issue looking at whether we should only take the 9-digit CUSIP from a filing when valid. Do that issue before this one.

bdcallen commented 4 years ago

@iangow Sorry, from one of your last comments, I assumed you had put comments on silent for yourself, leaving it to message to you in your inbox once I had used '@iangow'. Hence, the undeveloped nature of my comment above (I was committing stuff to edit later). But, yes, I'm working towards something of the nature you describe in your response.

It seems here that you are looking at 8- and 9-digit CUSIPs in the same filing. I think we already have an issue looking at whether we should only take the 9-digit CUSIP from a filing when valid. Do that issue before this one.

Yes, I'll redo the analysis joining only on cik. I agree that will probably give us a better idea of the value of the lower digit cusips

bdcallen commented 4 years ago

@iangow

> cusip_cik <- cusip_cik %>% mutate(cusip8 = substr(cusip, 1, 8))

> cusip_cik %>% filter(cusip_length == 8) %>% 
> inner_join(cusip_cik, by = 'cik') %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  5896

> cusip_cik %>% filter(cusip_length == 8) %>% distinct(cik, cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  6156

> cusip_cik %>% filter(cusip_length == 8) %>% 
> inner_join(cusip_cik, by = 'cik') %>% 
> filter(cusip_length.y == 9)  %>% filter(cusip6.x == cusip6.y) %>% 
> distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  3540

> cusip_cik %>% filter(cusip_length == 8) %>% 
> inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9)  %>% 
> filter(cusip8.x == cusip8.y) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  3168

So this does give a better idea. It seems just over half of cases, the 8 digit cusip matches the cusip8 of a 9 digit cusip, at least in some filing.

bdcallen commented 4 years ago

@iangow I think the next step is to see how many match with other digits omitted by mistake (which I suspect is also a common problem)

iangow commented 4 years ago

No problem. I think I see messages here because I am "participating" in this issue. It's good if you're heading that way already.

bdcallen commented 4 years ago

@iangow

> cusip_cik <- cusip_cik %>% mutate(cusip7_1 = paste0(substr(cusip, 1, 7), substr(cusip, 9, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip6_2 = paste0(substr(cusip, 1, 6), substr(cusip, 8, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip5_3 = paste0(substr(cusip, 1, 5), substr(cusip, 7, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip4_4 = paste0(substr(cusip, 1, 4), substr(cusip, 6, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip3_5 = paste0(substr(cusip, 1, 3), substr(cusip, 5, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip2_6 = paste0(substr(cusip, 1, 2), substr(cusip, 4, 9)))
> cusip_cik <- cusip_cik %>% mutate(cusip1_7 = paste0(substr(cusip, 1, 1), substr(cusip, 3, 9)))

> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip7_1.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   537

> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip6_2.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   193 

> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip5_3.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   251
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip4_4.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   415
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip3_5.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   323
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip2_6.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   225
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip1_7.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   150
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip0_8.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   338

> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip8.y == cusip.x | cusip7_1.y == cusip.x | cusip6_2.y == cusip.x | cusip5_3.y == cusip.x | cusip4_4.y == cusip.x | cusip3_5.y == cusip.x | cusip2_6.y == cusip.x | cusip1_7.y == cusip.x | cusip0_8.y == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  4662
> cusip_cik %>% filter(cusip_length == 8) %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  5896

So 6156 cases of 8 digit cusips, 5896 of them join to 9 digit cusips under joins just under cik. Then doing the other possible 8 digit subsets of the 9 digit cusips, 4662 have matches out of those 5896. So it seems around 80% of cases where an 8 digit cusip is associated with a 9 digit one under joins under cik, there is in fact some digit missing (mostly the last digit, as we thought, but a substantial amount of time it's other digits)

iangow commented 4 years ago

If 5,696 of the 8-digit CUSIP-CIK combinations can be found as (equivalent) 9-digit CUSIP-CIK combinations, then the only value of using 8-digit CUSIP-CIK combinations is the 260 incremental cases (6,156 - 5,896). So you should check some of those … if they look good, then I think we should keep them. (Note that if the 5,896 includes joins on digits 2 through 9 [the only other possibility for an eight-digit CUSIP], then it's not a "valid eight-digit CUSIP" and we don't want this. A "valid eight-digit CUSIP" is one that is substr(cusip9, 1, 8) for some valid cusip9.)

Then, consider 7-digit CUSIPs relative to the combined 8- and 9-digit CUSIP-CIK combinations in the same way.

bdcallen commented 4 years ago
> unique_8_df <- cusip_cik %>% filter(cusip_length == 8) %>% 
> anti_join(cusip_cik %>% filter(cusip_length == 8) %>% 
> inner_join(cusip_cik, by = 'cik') %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% 
> mutate(cusip = cusip.x), by = c('cik', 'cusip')) %>% 
> distinct(cik, cusip) %>% mutate(cusip6 = substr(cusip, 1, 6))
> unique_8_df
# A tibble: 260 x 3
       cik cusip    cusip6
     <int> <chr>    <chr> 
 1 1038370 8716Y107 8716Y1
 2  902528 03215109 032151
 3 1068104 67588103 675881
 4   18568 2049H109 2049H1
 5   18568 15242510 152425
 6  764763 89585105 895851
 7  729213 74791410 747914
 8  903954 64092410 640924
 9  903954 74726X10 74726X
10  903954 74731K10 74731K
# … with 250 more rows

> unique_8_df %>% filter(cusip == 'SCHEDULE') %>% count()
# A tibble: 1 x 1
      n
  <int>
1    15

> unique_8_df <- unique_8_df %>% filter(cusip != 'SCHEDULE') 
> unique_8_df
# A tibble: 245 x 3
       cik cusip    cusip6
     <int> <chr>    <chr> 
 1 1038370 8716Y107 8716Y1
 2  902528 03215109 032151
 3 1068104 67588103 675881
 4   18568 2049H109 2049H1
 5   18568 15242510 152425
 6  764763 89585105 895851
 7  729213 74791410 747914
 8  903954 64092410 640924
 9  903954 74726X10 74726X
10  903954 74731K10 74731K
# … with 235 more rows

> unique_8_df %>% group_by(cik) %>% 
> summarise(num = n())  %>% filter(num > 1)
# A tibble: 9 x 2
      cik   num
    <int> <int>
1   18568     2
2   50361     2
3  353718     3
4  764763     2
5  903954    72
6  904081     2
7 1002819     2
8 1006262     2
9 1093367     3

> unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% 
> summarise(num = n()) %>% filter(num > 1)
# A tibble: 8 x 2
      cik   num
    <int> <int>
1   18568     2
2   50361     2
3  353718     3
4  764763     2
5  903954    72
6  904081     2
7 1006262     2
8 1093367     3

> unique_8_df %>% filter(cik == 903954)
# A tibble: 72 x 3
      cik cusip    cusip6
    <int> <chr>    <chr> 
 1 903954 64092410 640924
 2 903954 74726X10 74726X
 3 903954 74731K10 74731K
 4 903954 85254C10 85254C
 5 903954 89341610 893416
 6 903954 89674L10 89674L
 7 903954 91383710 913837
 8 903954 75190730 751907
 9 903954 61744634 617446
10 903954 00253910 002539
# … with 62 more rows

> unique_8_df %>% mutate(ncusip = cusip) %>% 
> inner_join(stocknames, by = 'ncusip') %>% 
> distinct(cik, ncusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   158
> unique_8_df %>% mutate(ncusip = cusip) %>% 
> inner_join(stocknames, by = 'ncusip') %>% 
> distinct(cik, ncusip) %>% 
> filter(cik != 903954) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    93

> ciks %>% inner_join(unique_8_df, by = 'cik') %>% 
> inner_join(issuers, by = 'cusip6') %>% 
> filter(company_name == issuer_name_1) %>% 
> distinct(cik, cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    46
> unique_8_df %>% distinct(cik, cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   245

> ciks %>% inner_join(unique_8_df, by = 'cik') %>% 
> inner_join(issuers, by = 'cusip6') %>% 
> distinct(cik, cusip) %>% 
> filter(cik != 903954) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   135

> probs <- ciks %>% inner_join(unique_8_df, by = 'cik') %>% 
> inner_join(issuers, by = 'cusip6') %>% distinct(cik, cusip) %>% 
> filter(cik != 903954) %>% arrange(cik) %>% 
> group_by(cik) %>% summarise(num = n()) %>% 
> filter(num > 1)
> probs
# A tibble: 5 x 2
      cik   num
    <int> <int>
1  353718     3
2  904081     2
3 1002819     2
4 1006262     2
5 1093367     2

> ciks %>% inner_join(unique_8_df, by = 'cik') %>% 
> inner_join(issuers, by = 'cusip6') %>% 
> inner_join(probs, by = 'cik') %>% 
> select(cik, cusip, company_name, issuer_name_1)
# A tibble: 16 x 4
       cik cusip    company_name                 issuer_name_1               
     <int> <chr>    <chr>                        <chr>                       
 1 1002819 00891110 AIR CANADA /QUEBEC/          AIR CDA                     
 2 1002819 00891130 AIR CANADA /QUEBEC/          AIR CDA                     
 3 1093367 02932810 AMERICAN RIVERS OIL CO /DE/  AMERICAN RIVS OIL CO        
 4 1093367 02932010 AMERICAN RIVERS OIL CO /DE/  AMERICAN RICE INC DEL       
 5 1093367 02932810 AROC INC                     AMERICAN RIVS OIL CO        
 6 1093367 02932010 AROC INC                     AMERICAN RICE INC DEL       
 7  353718 48563610 CARL KARCHER ENTERPRISES INC KARCHER CARL ENTERPRISES INC
 8  353718 19121910 CARL KARCHER ENTERPRISES INC COCA COLA ENTERPRISES INC   
 9  353718 19104210 CARL KARCHER ENTERPRISES INC COBRA ELECTRS CORP          
10  353718 48563610 KARCHER CARL ENTERPRISES INC KARCHER CARL ENTERPRISES INC
11  353718 19121910 KARCHER CARL ENTERPRISES INC COCA COLA ENTERPRISES INC   
12  353718 19104210 KARCHER CARL ENTERPRISES INC COBRA ELECTRS CORP          
13 1006262 48625010 PROMIS SYSTEMS CORP LTD      KAUFMANN FD INC             
14 1006262 74341V10 PROMIS SYSTEMS CORP LTD      PROMIS SYS LTD              
15  904081 63577110 STECK VAUGHN PUBLISHING CORP NATIONAL ED CORP            
16  904081 85804510 STECK VAUGHN PUBLISHING CORP STECK-VAUGHN PUBG CORP      

> unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% summarise(num = n()) %>% filter(num == 1) %>% 
> inner_join(unique_8_df) %>% 
> inner_join(issuers, by = 'cusip6') %>% inner_join(ciks) %>% 
> select(cik, cusip, cusip6, issuer_name_1, company_name) %>% 
> arrange(cik)
Joining, by = "cik"
Joining, by = "cik"
# A tibble: 148 x 5
     cik cusip    cusip6 issuer_name_1             company_name                     
   <int> <chr>    <chr>  <chr>                     <chr>                            
 1 13006 17253C10 17253C CIRCA PHARMACEUTICALS INC CIRCA PHARMACEUTICALS INC        
 2 14177 10876310 108763 BRIDGFORD FOODS CORP      BRIDGFORD FOODS CORP             
 3 22896 20482G10 20482G COMPUSERVE CORP DEL       COMPUSERVE CORP                  
 4 28209 24955510 249555 DEPOSIT GTY CORP          DEPOSIT GUARANTY CORP            
 5 39838 36442410 364424 GALVESTON HOUSTON CO      GALVESTON HOUSTON CO             
 6 40675 37017210 370172 GENERAL KINETICS INC      GENERAL KINETICS INC             
 7 48544 44090300 440903 HOSPLEX SYS INC           HOSOI GARDEN MORTUARY INC        
 8 50253 45583010 455830 INDUSTRIAL ACOUSTICS INC  INDUSTRIAL ACOUSTICS CO INC      
 9 60549 54667680 546676 LOUISVILLE GAS & ELEC CO  LOUISVILLE GAS & ELECTRIC CO /KY/
10 65297 05873J10 05873J BALLYS GRAND INC          BALLYS GRAND INC /DE/            
# … with 138 more rows
bdcallen commented 4 years ago
> View(unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% summarise(num = n()) %>% filter(num == 1) %>% 
> inner_join(unique_8_df) %>% 
> inner_join(issuers, by = 'cusip6') %>% inner_join(ciks) %>% 
> select(cik, cusip, cusip6, issuer_name_1, company_name) %>% 
> arrange(cik))
cik cusip cusip6 issuer_name_1 company_name
1 13006 17253C10 17253C CIRCA PHARMACEUTICALS INC CIRCA PHARMACEUTICALS INC
2 14177 10876310 108763 BRIDGFORD FOODS CORP BRIDGFORD FOODS CORP
3 22896 20482G10 20482G COMPUSERVE CORP DEL COMPUSERVE CORP
4 28209 24955510 249555 DEPOSIT GTY CORP DEPOSIT GUARANTY CORP
5 39838 36442410 364424 GALVESTON HOUSTON CO GALVESTON HOUSTON CO
6 40675 37017210 370172 GENERAL KINETICS INC GENERAL KINETICS INC
7 48544 44090300 440903 HOSPLEX SYS INC HOSOI GARDEN MORTUARY INC
8 50253 45583010 455830 INDUSTRIAL ACOUSTICS INC INDUSTRIAL ACOUSTICS CO INC
9 60549 54667680 546676 LOUISVILLE GAS & ELEC CO LOUISVILLE GAS & ELECTRIC CO /KY/
10 65297 05873J10 05873J BALLYS GRAND INC BALLYS GRAND INC /DE/
11 74154 68267810 682678 ONEOK INC ONEOK INC
12 75748 69562410 695624 PAIN SUPPRESSION LABS INC PAIN SUPPRESSION LABS, INC.
13 81301 74714210 747142 PYRAMID COMMUNICATIONS INC NEW PYRAMID COMMUNICATIONS INC
14 97483 71891320 718913 PHOENIX RESOURCE COS INC PHOENIX RESOURCE COMPANIES INC
15 109870 37722620 377226 GLASSMASTER CO GLASSMASTER CO
16 225998 93881510 938815 WASHINGTON ENERGY CO WASHINGTON ENERGY CO
17 276886 53191120 531911 LIFETIME CORP LIFETIME CORP
18 316537 05384310 053843 AVOCA INC AVOCA INC
19 350305 83438010 834380 SOLV EX CORP SOLV EX CORP
20 350926 71365810 713658 PEREGRINE ENTMT LTD PEREGRINE ENTERTAINMENT LTD
21 350926 71365810 713658 PEREGRINE ENTMT LTD PEREGRINE ENTERTAINMENT
22 354827 23790310 237903 DATA MEASUREMENT CORP DATA MEASUREMENT CORP
23 357262 65881000 658810 NORTH CNTY BANCORP NORTH COUNTY BANCORP
24 712468 25059510 250595 DESOTO INC NARRAGANSETT FIRST FUND
25 712744 25390210 253902 DIGITAL COMMUNICATIONS ASSOC DIGITAL COMMUNICATIONS ASSOCIATES INC
26 717192 58977410 589774 MERIDIAN NATL CORP MERIDIAN NATIONAL CORP
27 719264 31946210 319462 FIRST CTZNS BANCSHARES INC FIRST CITIZENS BANCSHARES INC /TN/
28 725014 75449810 754498 RAWSON-KOENIG INC RAWSON KOENIG INC
29 727127 78403010 784030 SCS COMPUTE INC SCS COMPUTE INC
30 729213 74791410 747914 Q MED INC Q MED INC
31 729213 74791410 747914 Q MED INC QMED INC
32 737033 74835710 748357 QUESTECH INC QUESTECH INC
33 740074 55349510 553495 MSA RLTY CORP MSA REALTY CORP
34 761651 37248610 372486 GEORGE MASON BANKSHARES INC MASON GEORGE BANKSHARES INC
35 768863 88254510 882545 TEXAS MERIDIAN RES CORP TEXAS MERIDIAN RESOURCES LTD
36 777844 20992910 209929 COMPUSONICS VIDEO CORP COMPUSONICS VIDEO CORP
37 789577 89324510 893245 TRANS LEASING INTL INC TLI LIQUIDATING CORP
38 789577 89324510 893245 TRANS LEASING INTL INC TRANS LEASING INTERNATIONAL INC
39 789987 42990810 429908 HIGH CASH PARTNERS LP RESOURCES PENSION SHARES 5 LP
40 790025 14030810 140308 CAPITAL MEDIA GROUP LTD CAPITAL MEDIA GROUP LTD
41 791350 40641710 406417 HALSTEAD ENERGY CORP HALSTEAD ENERGY CORP
42 794984 42990810 429908 HIGH CASH PARTNERS LP HIGH CASH PARTNERS L P
43 799723 37936K00 37936K GLOBAL WRESTLING ALLIANCE INC BIONET TECHNOLOGIES INC
44 799723 37936K00 37936K GLOBAL WRESTLING ALLIANCE INC PRATT WYLCE & LORDS LTD
45 801121 45753810 457538 INMAC CORP INMAC CORP
46 803058 89151110 891511 TOTAL RESH CORP TOTAL RESEARCH CORP
47 808575 71913203 719132 PHOENIX RESTAURANT GROUP INC HUGHES RESOURCES INC
48 808575 71913203 719132 PHOENIX RESTAURANT GROUP INC PHOENIX RESOURCES TECHNOLOGIES INC
49 810578 79284810 792848 ST PAUL BANCORP INC ST PAUL BANCORP INC
50 810995 01880500 018805 ALLIANZ SOCIETAS EUROPAEA-SE ALLIED BANKSHARES INC
51 811213 35834019 358340 FREYMILLER TRUCKING INC FREYMILLER TRUCKING INC
52 814479 39682910 396829 GREENWICH FINL CORP GREENWICH FINANCIAL CORP
53 825288 68219P10 68219P ON TECHNOLOGY CORP INSURED MUNICIPALS INCOME TRUST SERIES 200
54 828360 82539610 825396 SHOWPOWER INC SHOWPOWER INC
55 837028 87236310 872363 TDX CORP TDX CORP
56 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA S A
57 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA, S.A.
58 843090 03253500 032535 ANAGRAM PLUS INC ANANGEL AMERICAN SHIPHOLDINGS LTD
59 846046 85232N10 85232N STACKPOLE LTD STACKPOLE CORP
60 848465 39015810 390158 GREAT BEAR TECHNOLOGY INC STARPRESS INC
61 849314 08179540 081795 BENETTON GROUP SPA BENETTON GROUP SPA
62 853927 02744L10 02744L AMERICAN MEDIA INC AMERICAN MEDIA OPERATIONS INC
63 859632 90249910 902499 TYSONS FINL CORP TYSONS FINANCIAL CORP
64 862450 03922910 039229 ARCADIAN CORP ARCADIAN CORP
65 863924 19248110 192481 COHO ENERGY INC COHO RESOURCES INC
66 868568 55920000 559200 MAGNA BANCORP INC MAGNA BANCORP INC
67 874758 14790410 147904 CASH CAN INC CASH CAN INC
68 875174 59373610 593736 MIAMI SUBS CORP MIAMI SUBS CORP
69 877210 86431310 864313 SUBMICRON SYS CORP SUBMICRON SYSTEMS CORP
70 877984 09161R10 09161R BIOMIRA INC BIOMIRA INC
71 881443 97374210 973742 WINDSOR REAL ESTATE INVT TR 8 N TANDEM TRUST
72 881443 97374210 973742 WINDSOR REAL ESTATE INVT TR 8 WINDSOR REAL ESTATE INVESTMENT TRUST 8
73 882264 03852L10 03852L ARAKIS ENERGY CORP ARAKIS ENERGY CORP
74 882321 35168810 351688 FRAME TECHNOLOGY INC FRAME TECHNOLOGY CORP/CA
75 887315 42209410 422094 HEADSTRONG GROUP INC HEADSTRONG GROUP INC
76 894541 08659410 086594 BETA WELL SVC INC BETA WELL SERVICE INC
77 897074 86422210 864222 STYLES ON VIDEO INC STYLES ON VIDEO INC
78 897878 00371480 003714 ABN AMRO NORTH AMER INC ABN AMRO BANK NV
79 897878 00371480 003714 ABN AMRO NORTH AMER INC ROYAL BANK OF SCOTLAND N.V.
80 897910 61735710 617357 MORGAN GRENFELL SMALLCAP FD MORGAN GRENFELL CAPITAL MANAGEMENT
81 900385 15043810 150438 CEDAR INCOME FD LTD AEGON USA INC
82 904900 08372L10 08372L BERG ELECTRS CORP BERG ELECTRONICS CORP /DE/
83 906391 12628810 126288 CSB FINL CORP CSB FINANCIAL CORP
84 909723 44159410 441594 HOULIHANS RESTAURANT GROUP INC HOULIHANS RESTAURANT GROUP INC
85 914344 05873110 058731 BALLY GAMING INTL INC BALLY GAMING INC
86 916536 40862310 408623 HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS LP
87 916537 40862310 408623 HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
88 922738 86330P10 86330P STRESSGEN BIOTECHNOLOGIES STRESSGEN BIOTECHNOLOGIES CORP                          /FI
89 924835 88871110 888711 TIVOLI INDS INC TIVOLI INDUSTRIES INC
90 927027 02359100 023591 AMERCO INC SHOEN JAMES P
91 927040 48203317 482033 JUNO ACQUISITIONS INC AREMISSOFT CORP
92 927040 48203317 482033 JUNO ACQUISITIONS INC JUNO ACQUISITIONS INC
93 928507 12685910 126859 CABLEMAXX INC CABLEMAXX INC /DE
94 928539 64107410 641074 NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
95 929757 87259109 872591 TNBANK OAK RIDGE TENN T NETIX INC
96 933425 46530610 465306 ITALIAN OVEN INC ACCREDITED BUSINESS CONSOLIDATORS CORP.
97 933425 46530610 465306 ITALIAN OVEN INC ITALIAN OVEN INC
98 933425 46530610 465306 ITALIAN OVEN INC STORE LIQUIDATION CO INC
99 943230 89816810 898168 TRUMP HOTELS CASINO RESORTS TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151
100 948753 69367410 693674 PXRE CORP PXRE CORP /FA/
101 1002819 00891110 008911 AIR CDA AIR CANADA /QUEBEC/
102 1002819 00891130 008911 AIR CDA AIR CANADA /QUEBEC/
103 1004278 35849710 358497 FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
104 1010136 45192110 451921 IMAGEMATRIX CORP IMAGEMATRIX CORP
105 1010312 68267810 682678 ONEOK INC WESTAR CAPITAL INC
106 1010312 68267810 682678 ONEOK INC WESTAR INDUSTRIES INC
107 1013050 74257410 742574 PRINTRAK INTL INC PRINTRAK INTERNATIONAL INC
108 1021255 74037Q10 74037Q PREFERRED INCOME MGMT FD INC HOREJSI ENTERPRISES INC
109 1021255 74037Q10 74037Q PREFERRED INCOME MGMT FD INC HOREJSI INC
110 1022097 89990510 899905 TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLGIES INC
111 1022097 89990510 899905 TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES INC
112 1022097 89990510 899905 TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES, INC
113 1030614 98901220 989012 ZANE INTERACTIVE PUBG INC ZANE INTERACTIVE PUBLISHING INC
114 1030814 90345610 903456 UBARTER COM INC INTERNATIONAL BARTER CORP
115 1030814 90345610 903456 UBARTER COM INC UBARTER COM INC
116 1035028 62907710 629077 NEI WEBWORLD INC NEI WEBWORLD INC
117 1040476 29439810 294398 EQUALITY BANCORP INC EQUALITY BANCORP INC
118 1040649 74871810 748718 QUINENCO S A QUINENCO SA
119 1041970 02914P20 02914P AMERICAN QUANTUM CYCLES INC AMERICAN QUANTUM CYCLES INC
120 1044278 35849710 358497 FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
121 1055037 72345109 723451 PINNACLE BK JASPER ALA PINNACLE BUSINESS MANAGEMENT INC
122 1057437 30267410 302674 FVNB CORP FVNB CORP
123 1062023 60753310 607533 MODEM MEDIA INC OPTIMARK HOLDINGS INC
124 1062023 60753310 607533 MODEM MEDIA INC OPTIMARK TECHNOLOGIES INC
125 1064133 98144Q10 98144Q WORLD COMM ONLINE INC DEL WORLD COMMERCE ONLINE INC
126 1064990 86330P10 86330P STRESSGEN BIOTECHNOLOGIES STRESSGEN BIOTECHNOLOGIES CORP
127 1066114 87923P10 87923P TELE CENTRO OESTE CELULAR TELE CENTRO DESTE CELULAR PARTICIPACOES
128 1066114 87923P10 87923P TELE CENTRO OESTE CELULAR TELE CENTRO OESTE CELULAR PARTICIPACOES
129 1071355 74046M30 74046M PREMIER BRANDS INC CATHAYONE INC
130 1071355 74046M30 74046M PREMIER BRANDS INC PREMIER BRANDS INC/UT
131 1075710 15691510 156915 CERTICOM CORP CERTICOM CORP
132 1101734 62885310 628853 NCL HLDG ASA ARRASAS LTD
133 1106946 19419D10 19419D COLLECTIBLE CONCEPTS GROUP INC COLLECTIBLE CONCEPTS GROUP INC
134 1110072 57951710 579517 MCCLENDON TRANSN GROUP INC MCCLENDON TRANSPORTATION GROUP INC
135 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC DE
136 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC
137 1120802 36298R10 36298R GS AGRIFUELS CORP GS AGRIFUELS CORP
138 1120802 36298R10 36298R GS AGRIFUELS CORP HUGO INTERNATIONAL TELECOM INC
139 1200598 87927W10 87927W TELECOM ITALIA SPA TELECOM ITALIA SPA
140 1218267 92390110 923901 VERMILION RES LTD VERMILION RESOURCES LTD
141 1218991 97263M10 97263M WIMM-BILL-DANN FOODS OJSC YUSHVAEV GAVRIL A
142 1322387 58818320 588183 MERCHANDISE CREATIONS INC INTELLIGENTIAS, INC.
143 1322387 58818320 588183 MERCHANDISE CREATIONS INC MERCHANDISE CREATIONS, INC.
144 1347004 62847L10 62847L MYECHECK INC MYECHECK INC.
145 1347004 62847L10 62847L MYECHECK INC SEKOYA HOLDINGS LTD.
146 1374976 66651104 666511 NORTHLAND PWR INC NORTHPORT CAPITAL INC.
147 1374976 66651104 666511 NORTHLAND PWR INC NORTHPORT NETWORK SYSTEMS, INC.
148 1399815 66986W10 66986W NOVAMED INC DEL NOVAMED EYECARE RESEARCH, INC
bdcallen commented 4 years ago
> View(unique_8_df %>% distinct(cik, cusip6) %>% group_by(cik) %>% 
> summarise(num = n()) %>% filter(num == 1) %>% 
> inner_join(unique_8_df) %>% inner_join(issuers, by = 'cusip6') %>% 
> inner_join(ciks) %>% 
> select(cik, cusip, cusip6, issuer_name_1, company_name) %>% 
> arrange(cik) %>% filter(cik %in% soft_match_ciks))
cik cusip cusip6 issuer_name_1 company_name
1 808575 71913203 719132 PHOENIX RESTAURANT GROUP INC HUGHES RESOURCES INC
2 808575 71913203 719132 PHOENIX RESTAURANT GROUP INC PHOENIX RESOURCES TECHNOLOGIES INC
3 863924 19248110 192481 COHO ENERGY INC COHO RESOURCES INC
4 882321 35168810 351688 FRAME TECHNOLOGY INC FRAME TECHNOLOGY CORP/CA
5 897878 00371480 003714 ABN AMRO NORTH AMER INC ABN AMRO BANK NV
6 897878 00371480 003714 ABN AMRO NORTH AMER INC ROYAL BANK OF SCOTLAND N.V.
7 897910 61735710 617357 MORGAN GRENFELL SMALLCAP FD MORGAN GRENFELL CAPITAL MANAGEMENT
8 916537 40862310 408623 HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
9 922738 86330P10 86330P STRESSGEN BIOTECHNOLOGIES STRESSGEN BIOTECHNOLOGIES CORP                          /FI
10 928539 64107410 641074 NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
11 1044278 35849710 358497 FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
12 1055037 72345109 723451 PINNACLE BK JASPER ALA PINNACLE BUSINESS MANAGEMENT INC
13 1064990 86330P10 86330P STRESSGEN BIOTECHNOLOGIES STRESSGEN BIOTECHNOLOGIES CORP
14 1066114 87923P10 87923P TELE CENTRO OESTE CELULAR TELE CENTRO DESTE CELULAR PARTICIPACOES
15 1066114 87923P10 87923P TELE CENTRO OESTE CELULAR TELE CENTRO OESTE CELULAR PARTICIPACOES
16 1399815 66986W10 66986W NOVAMED INC DEL NOVAMED EYECARE RESEARCH, INC
> View(unique_8_df %>% distinct(cik, cusip6) %>% group_by(cik) %>% 
> summarise(num = n()) %>% filter(num == 1) %>% 
> inner_join(unique_8_df) %>% inner_join(issuers, by = 'cusip6') %>% 
> inner_join(ciks) %>% 
> select(cik, cusip, cusip6, issuer_name_1, company_name) %>% 
> arrange(cik) %>% filter(cik %in% not_a_match_ciks))
cik cusip cusip6 issuer_name_1 company_name
1 48544 44090300 440903 HOSPLEX SYS INC HOSOI GARDEN MORTUARY INC
2 712468 25059510 250595 DESOTO INC NARRAGANSETT FIRST FUND
3 789987 42990810 429908 HIGH CASH PARTNERS LP RESOURCES PENSION SHARES 5 LP
4 799723 37936K00 37936K GLOBAL WRESTLING ALLIANCE INC BIONET TECHNOLOGIES INC
5 799723 37936K00 37936K GLOBAL WRESTLING ALLIANCE INC PRATT WYLCE & LORDS LTD
6 810995 01880500 018805 ALLIANZ SOCIETAS EUROPAEA-SE ALLIED BANKSHARES INC
7 825288 68219P10 68219P ON TECHNOLOGY CORP INSURED MUNICIPALS INCOME TRUST SERIES 200
8 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA S A
9 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA, S.A.
10 843090 03253500 032535 ANAGRAM PLUS INC ANANGEL AMERICAN SHIPHOLDINGS LTD
11 848465 39015810 390158 GREAT BEAR TECHNOLOGY INC STARPRESS INC
12 900385 15043810 150438 CEDAR INCOME FD LTD AEGON USA INC
13 927027 02359100 023591 AMERCO INC SHOEN JAMES P
14 929757 87259109 872591 TNBANK OAK RIDGE TENN T NETIX INC
15 943230 89816810 898168 TRUMP HOTELS CASINO RESORTS TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151
16 1004278 35849710 358497 FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
17 1010312 68267810 682678 ONEOK INC WESTAR CAPITAL INC
18 1010312 68267810 682678 ONEOK INC WESTAR INDUSTRIES INC
19 1021255 74037Q10 74037Q PREFERRED INCOME MGMT FD INC HOREJSI ENTERPRISES INC
20 1021255 74037Q10 74037Q PREFERRED INCOME MGMT FD INC HOREJSI INC
21 1062023 60753310 607533 MODEM MEDIA INC OPTIMARK HOLDINGS INC
22 1062023 60753310 607533 MODEM MEDIA INC OPTIMARK TECHNOLOGIES INC
23 1101734 62885310 628853 NCL HLDG ASA ARRASAS LTD
24 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC DE
25 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC
26 1218991 97263M10 97263M WIMM-BILL-DANN FOODS OJSC YUSHVAEV GAVRIL A
27 1374976 66651104 666511 NORTHLAND PWR INC NORTHPORT CAPITAL INC.
28 1374976 66651104 666511 NORTHLAND PWR INC NORTHPORT NETWORK SYSTEMS, INC.
bdcallen commented 4 years ago
> View(unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% summarise(num = n()) %>% 
> filter(num == 1) %>% inner_join(unique_8_df) %>% 
> mutate(ncusip = cusip) %>% 
> inner_join(stocknames, by = 'ncusip') %>% 
> inner_join(ciks) %>% 
> select(cik, ncusip, namedt, nameenddt, ticker, comnam, company_name))
cik ncusip namedt nameenddt ticker comnam company_name
1 13006 17253C10 1993-03-26 1995-07-17 RXC CIRCA PHARMACEUTICALS INC CIRCA PHARMACEUTICALS INC
2 14177 10876310 1972-12-14 2019-12-31 BRID BRIDGFORD FOODS CORP BRIDGFORD FOODS CORP
3 22896 20482G10 1996-04-19 1998-01-30 CSRV COMPUSERVE INC DEL COMPUSERVE CORP
4 28209 24955510 1972-12-14 1996-12-04 DEPS DEPOSIT GUARANTY CORP DEPOSIT GUARANTY CORP
5 28209 24955510 1996-12-05 1998-04-30 DEP DEPOSIT GUARANTY CORP DEPOSIT GUARANTY CORP
6 39838 36442410 1975-06-11 1979-11-11 GVTN GALVESTON HOUSTON CO GALVESTON HOUSTON CO
7 39838 36442410 1979-11-12 1995-02-28 GHX GALVESTON HOUSTON CO GALVESTON HOUSTON CO
8 40675 37017210 1986-08-25 1991-11-20 GKIE GENERAL KINETICS INC GENERAL KINETICS INC
9 40675 37017210 1991-11-21 1997-05-29 GKI GENERAL KINETICS INC GENERAL KINETICS INC
10 50253 45583010 1972-12-14 1998-10-29 IACI INDUSTRIAL ACOUSTICS INC INDUSTRIAL ACOUSTICS CO INC
11 65297 05873J10 1994-08-01 1998-03-26 BGLV BALLYS GRAND INC BALLYS GRAND INC /DE/
12 74154 68267810 1980-12-10 1997-11-27 OKE ONEOK INC ONEOK INC
13 75748 69562410 1986-09-09 1989-09-05 PAIN PAIN SUPPRESSION LABS INC PAIN SUPPRESSION LABS, INC.
14 97483 71891320 1992-05-13 1992-10-21 PHNI PHOENIX RESOURCE COS INC PHOENIX RESOURCE COMPANIES INC
15 97483 71891320 1992-10-22 1996-05-20 PHN PHOENIX RESOURCE COS INC PHOENIX RESOURCE COMPANIES INC
16 109870 37722620 1988-07-15 1994-05-23 GLMA GLASSMASTER COMPANY GLASSMASTER CO
17 109870 37722620 1994-05-24 1999-11-05 GLMA GLASSMASTER CO GLASSMASTER CO
18 225998 93881510 1978-08-10 1992-08-03 WECO WASHINGTON ENERGY CO WASHINGTON ENERGY CO
19 225998 93881510 1992-08-04 1997-02-10 WEG WASHINGTON ENERGY CO WASHINGTON ENERGY CO
20 276886 53191120 1989-07-06 1993-07-30 LFT LIFETIME CORP LIFETIME CORP
21 350305 83438010 1981-07-08 1997-07-11 SOLV SOLV EX CORP SOLV EX CORP
22 350926 71365810 1984-01-25 1985-10-07 MOVE PEREGRINE ENTERTAINMENT INC PEREGRINE ENTERTAINMENT LTD
23 350926 71365810 1984-01-25 1985-10-07 MOVE PEREGRINE ENTERTAINMENT INC PEREGRINE ENTERTAINMENT
24 354827 23790310 1986-12-18 1996-01-10 DMCB DATA MEASUREMENT CORP DATA MEASUREMENT CORP
25 712468 25059510 1968-01-02 1996-09-27 DSO DE SOTO INC NARRAGANSETT FIRST FUND
26 712744 25390210 1983-02-10 1987-12-10 DCAI DIGITAL COMMUNICATIONS ASSOC IN DIGITAL COMMUNICATIONS ASSOCIATES INC
27 712744 25390210 1987-12-11 1993-11-08 DCA DIGITAL COMMUNICATIONS ASSOC IN DIGITAL COMMUNICATIONS ASSOCIATES INC
28 717192 58977410 1986-05-13 1993-08-17 MRCO MERIDIAN NATIONAL CORP MERIDIAN NATIONAL CORP
29 725014 75449810 1987-09-17 1987-09-17 KOEN RAWSON KOENIG INC RAWSON KOENIG INC
30 725014 75449810 1987-09-18 1997-11-23 RAKO RAWSON KOENIG INC RAWSON KOENIG INC
31 727127 78403010 1986-09-03 1996-02-05 SCOM S C S COMPUTE INC SCS COMPUTE INC
32 729213 74791410 1984-01-13 2002-05-27 QEKG Q MED Q MED INC
33 729213 74791410 1984-01-13 2002-05-27 QEKG Q MED QMED INC
34 729213 74791410 2002-05-28 2008-05-20 QMED QMED INC Q MED INC
35 729213 74791410 2002-05-28 2008-05-20 QMED QMED INC QMED INC
36 737033 74835710 1984-05-14 1998-11-18 QTEC QUESTECH INC QUESTECH INC
37 740074 55349510 1984-10-01 1994-09-01 SSS M S A REALTY CORP MSA REALTY CORP
38 761651 37248610 1993-07-01 1998-04-01 GMBS GEORGE MASON BANKSHARES INC MASON GEORGE BANKSHARES INC
39 768863 88254510 1990-12-19 1997-04-02 TMR TEXAS MERIDIAN RESOURCES CORP TEXAS MERIDIAN RESOURCES LTD
40 768863 88254510 1997-04-03 1997-06-19 TMR TEXAS MERIDIAN RESOURCES CORP TEXAS MERIDIAN RESOURCES LTD
41 789577 89324510 1986-04-04 1997-12-18 TLII TRANS LEASING INTL INC TLI LIQUIDATING CORP
42 789577 89324510 1986-04-04 1997-12-18 TLII TRANS LEASING INTL INC TRANS LEASING INTERNATIONAL INC
43 791350 40641710 1995-03-13 1998-10-15 HSNR HALSTEAD ENERGY CORP HALSTEAD ENERGY CORP
44 801121 45753810 1986-10-20 1996-01-25 INMC INMAC CORP INMAC CORP
45 803058 89151110 1987-03-26 2001-11-02 TOTL TOTAL RESEARCH CORP TOTAL RESEARCH CORP
46 810578 79284810 1987-05-18 1999-10-01 SPBC ST PAUL BANCORP INC ST PAUL BANCORP INC
47 814479 39682910 1988-04-15 1994-01-31 GFCT GREENWICH FINANCIAL CORP GREENWICH FINANCIAL CORP
48 825288 68219P10 1995-08-01 2004-02-13 ONTC ON TECHNOLOGY CORP INSURED MUNICIPALS INCOME TRUST SERIES 200
49 828360 82539610 1998-06-16 2000-01-28 SHO SHOWPOWER INC SHOWPOWER INC
50 837028 87236310 1993-01-15 1995-09-01 TDXC T D X CORP TDX CORP
51 842180 05959110 1993-11-24 2001-02-27 BFR BANCO FRANCES DEL RIO DE LA PLA BANCO BILBAO VIZCAYA ARGENTARIA S A
52 842180 05959110 1993-11-24 2001-02-27 BFR BANCO FRANCES DEL RIO DE LA PLA BANCO BILBAO VIZCAYA ARGENTARIA, S.A.
53 849314 08179540 1989-06-09 2007-10-18 BNG BENETTON GROUP S P A BENETTON GROUP SPA
54 853927 02744L10 1994-11-21 1999-05-07 ENQ AMERICAN MEDIA INC AMERICAN MEDIA OPERATIONS INC
55 859632 90249910 1996-05-24 1998-02-27 TYFC TYSONS FINANCIAL CORP TYSONS FINANCIAL CORP
56 862450 03922910 1995-08-04 1997-03-06 ACA ARCADIAN CORP ARCADIAN CORP
57 863924 19248110 1993-10-04 1999-03-05 COHO COHO ENERGY INC COHO RESOURCES INC
58 875174 59373610 1991-10-03 1999-01-07 SUBS MIAMI SUBS CORP MIAMI SUBS CORP
59 877210 86431310 1993-09-01 1998-04-16 SUBM SUBMICRON SYSTEMS CORP SUBMICRON SYSTEMS CORP
60 877984 09161R10 1991-11-19 1999-03-14 BIOMF BIOMIRA INC BIOMIRA INC
61 877984 09161R10 1999-03-15 2007-12-10 BIOM BIOMIRA INC BIOMIRA INC
62 882264 03852L10 1993-06-10 1995-08-22 AKSEF ARAKIS ENERGY CORP ARAKIS ENERGY CORP
63 882264 03852L10 1995-08-23 1995-09-21 AKSEF ARAKIS ENERGY CORP ARAKIS ENERGY CORP
64 882264 03852L10 1995-09-22 1998-10-08 AKSEF ARAKIS ENERGY CORP ARAKIS ENERGY CORP
65 882321 35168810 1992-02-12 1995-10-27 FRAM FRAME TECHNOLOGY CORP FRAME TECHNOLOGY CORP/CA
66 894541 08659410 1993-01-29 1994-10-13 BWS BETA WELL SERVICE INC BETA WELL SERVICE INC
67 897074 86422210 1993-04-14 1994-12-27 SOV STYLES ON VIDEO INC STYLES ON VIDEO INC
68 897910 61735710 1987-05-08 2000-10-31 MGC MORGAN GRENFELL SMALLCAP FUND MORGAN GRENFELL CAPITAL MANAGEMENT
69 900385 15043810 1986-12-17 1989-10-01 CEDR CEDAR INCOME FD 1 LTD AEGON USA INC
70 900385 15043810 1989-10-02 2000-02-28 CEDR CEDAR INCOME FUND LTD AEGON USA INC
71 904900 08372L10 1996-03-01 1998-10-09 BEI BERG ELECTRONICS CORP BERG ELECTRONICS CORP /DE/
72 906391 12628810 1993-09-28 1996-04-30 COSB C S B FINANCIAL CORP CSB FINANCIAL CORP
73 914344 05873110 1991-11-11 1996-06-18 BGII BALLY GAMING INTERNATIONAL INC BALLY GAMING INC
74 916536 40862310 1994-11-17 2000-02-27 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS LP
75 916536 40862310 2000-02-28 2005-09-16 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS LP
76 916537 40862310 1994-11-17 2000-02-27 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
77 916537 40862310 2000-02-28 2005-09-16 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
78 924835 88871110 1994-09-21 1999-03-28 TVLI TIVOLI INDUSTRIES INC TIVOLI INDUSTRIES INC
79 928507 12685910 1993-11-19 1994-12-22 CMAX CABLEMAXX INC CABLEMAXX INC /DE
80 928507 12685910 1994-12-23 1996-02-23 CMAX CABLEMAXX HOLDINGS INC CABLEMAXX INC /DE
81 928539 64107410 1983-12-22 1991-12-19 NEST NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
82 928539 64107410 1991-12-20 2005-06-30 NA NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
83 933425 46530610 1995-11-21 1996-12-30 OVEN ITALIAN OVEN INC ACCREDITED BUSINESS CONSOLIDATORS CORP.
84 933425 46530610 1995-11-21 1996-12-30 OVEN ITALIAN OVEN INC ITALIAN OVEN INC
85 933425 46530610 1995-11-21 1996-12-30 OVEN ITALIAN OVEN INC STORE LIQUIDATION CO INC
86 943230 89816810 1995-06-07 2004-08-09 DJT TRUMP HOTELS & CASINO RESRTS IN TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151
87 948753 69367410 1994-05-24 1996-12-30 PXRE P X R E CORP PXRE CORP /FA/
88 948753 69367410 1996-12-31 1999-10-05 PXT P X R E CORP PXRE CORP /FA/
89 1002819 00891130 1995-11-16 2002-12-02 ACNAF AIR CANADA INC AIR CANADA /QUEBEC/
90 1004278 35849710 1997-11-14 2000-06-07 FRND FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
91 1004278 35849710 2000-06-08 2007-08-30 FRN FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
92 1010136 45192110 1996-06-04 1999-02-02 IMCX IMAGEMATRIX CORP IMAGEMATRIX CORP
93 1010312 68267810 1980-12-10 1997-11-27 OKE ONEOK INC WESTAR CAPITAL INC
94 1010312 68267810 1980-12-10 1997-11-27 OKE ONEOK INC WESTAR INDUSTRIES INC
95 1013050 74257410 1996-07-02 2000-11-13 AFIS PRINTRAK INTERNATIONAL INC PRINTRAK INTERNATIONAL INC
96 1021255 74037Q10 1993-02-12 1999-08-29 PFM PREFERRED INCOME MGT FD INC HOREJSI ENTERPRISES INC
97 1021255 74037Q10 1993-02-12 1999-08-29 PFM PREFERRED INCOME MGT FD INC HOREJSI INC
98 1022097 89990510 1997-03-26 1998-07-28 TRBDF TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLGIES INC
99 1022097 89990510 1997-03-26 1998-07-28 TRBDF TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES INC
100 1022097 89990510 1997-03-26 1998-07-28 TRBDF TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES, INC
101 1022097 89990510 1998-07-29 1999-03-31 TRBD TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLGIES INC
102 1022097 89990510 1998-07-29 1999-03-31 TRBD TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES INC
103 1022097 89990510 1998-07-29 1999-03-31 TRBD TURBODYNE TECHNOLOGIES INC TURBODYNE TECHNOLOGIES, INC
104 1024627 M9888A10 1996-11-01 1998-08-05 ZAGIF ZAG INDUSTRIES LTD ZAG INDUSTRIES LTD
105 1035028 62907710 1997-05-21 1998-10-21 NEIP N E I WEBWORLD INC NEI WEBWORLD INC
106 1040476 29439810 1997-12-02 2000-11-15 EBI EQUALITY BANCORP INC EQUALITY BANCORP INC
107 1040649 74871810 1997-06-25 2007-01-19 LQ QUINENCO S A QUINENCO SA
108 1041970 02914P20 1999-09-08 2000-10-12 AFV AMERICAN QUANTUM CYCLES INC AMERICAN QUANTUM CYCLES INC
109 1044278 35849710 1997-11-14 2000-06-07 FRND FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
110 1044278 35849710 2000-06-08 2007-08-30 FRN FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
111 1057437 30267410 1998-09-16 2001-08-13 FVNB F V N B CORP FVNB CORP
112 1062023 60753310 1999-02-05 2000-06-05 MMPT MODEM MEDIA POPPE TYSON INC OPTIMARK HOLDINGS INC
113 1062023 60753310 1999-02-05 2000-06-05 MMPT MODEM MEDIA POPPE TYSON INC OPTIMARK TECHNOLOGIES INC
114 1062023 60753310 2000-06-06 2004-10-15 MMPT MODEM MEDIA INC OPTIMARK HOLDINGS INC
115 1062023 60753310 2000-06-06 2004-10-15 MMPT MODEM MEDIA INC OPTIMARK TECHNOLOGIES INC
116 1075710 15691510 2000-05-02 2002-06-17 CERT CERTICOM CORP CERTICOM CORP
117 1096300 M4591910 2000-08-02 2001-08-01 FLRE FLOWARE WIRELESS SYSTEMS INC FLOWARE WIRELESS SYSTEMS LTD
118 1101734 62885310 1999-07-09 2000-12-01 NRW N C L HOLDING ASA ARRASAS LTD
119 1113227 90337P10 1994-01-21 2000-05-03 USWI US WATS INC CAPSULE COMMUNICATIONS INC DE
120 1113227 90337P10 1994-01-21 2000-05-03 USWI US WATS INC CAPSULE COMMUNICATIONS INC
121 1135254 M7024010 2000-08-08 2019-12-31 MNDO MIND C T I LTD EISINGER MONICA
122 1135254 M7024010 2000-08-08 2019-12-31 MNDO MIND C T I LTD IANCU MONICA
123 1200598 87927W10 1997-07-21 2003-08-01 TI TELECOM ITALIA SPA TELECOM ITALIA SPA
124 1218991 97263M10 2002-02-08 2011-05-27 WBD WIMM BILL DANN FOODS O J S C YUSHVAEV GAVRIL A
125 1399815 66986W10 1999-08-18 2004-03-28 NOVA NOVAMED EYECARE INC NOVAMED EYECARE RESEARCH, INC
126 1399815 66986W10 2004-03-29 2010-05-31 NOVA NOVAMED INC NOVAMED EYECARE RESEARCH, INC
bdcallen commented 4 years ago
> View(unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% summarise(num = n()) %>% 
> filter(num == 1) %>% inner_join(unique_8_df) %>% 
> mutate(ncusip = cusip) %>% 
> inner_join(stocknames, by = 'ncusip') %>% inner_join(ciks) %>% 
> select(cik, ncusip, namedt, nameenddt, ticker, comnam, company_name) %>% 
> filter(cik %in% soft_match_stocknames_ciks))
cik ncusip namedt nameenddt ticker comnam company_name
1 22896 20482G10 1996-04-19 1998-01-30 CSRV COMPUSERVE INC DEL COMPUSERVE CORP
2 50253 45583010 1972-12-14 1998-10-29 IACI INDUSTRIAL ACOUSTICS INC INDUSTRIAL ACOUSTICS CO INC
3 350926 71365810 1984-01-25 1985-10-07 MOVE PEREGRINE ENTERTAINMENT INC PEREGRINE ENTERTAINMENT LTD
4 350926 71365810 1984-01-25 1985-10-07 MOVE PEREGRINE ENTERTAINMENT INC PEREGRINE ENTERTAINMENT
5 768863 88254510 1990-12-19 1997-04-02 TMR TEXAS MERIDIAN RESOURCES CORP TEXAS MERIDIAN RESOURCES LTD
6 768863 88254510 1997-04-03 1997-06-19 TMR TEXAS MERIDIAN RESOURCES CORP TEXAS MERIDIAN RESOURCES LTD
7 789577 89324510 1986-04-04 1997-12-18 TLII TRANS LEASING INTL INC TLI LIQUIDATING CORP
8 789577 89324510 1986-04-04 1997-12-18 TLII TRANS LEASING INTL INC TRANS LEASING INTERNATIONAL INC
9 853927 02744L10 1994-11-21 1999-05-07 ENQ AMERICAN MEDIA INC AMERICAN MEDIA OPERATIONS INC
10 863924 19248110 1993-10-04 1999-03-05 COHO COHO ENERGY INC COHO RESOURCES INC
11 897910 61735710 1987-05-08 2000-10-31 MGC MORGAN GRENFELL SMALLCAP FUND MORGAN GRENFELL CAPITAL MANAGEMENT
12 914344 05873110 1991-11-11 1996-06-18 BGII BALLY GAMING INTERNATIONAL INC BALLY GAMING INC
13 916536 40862310 1994-11-17 2000-02-27 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS LP
14 916536 40862310 2000-02-28 2005-09-16 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS LP
15 916537 40862310 1994-11-17 2000-02-27 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
16 916537 40862310 2000-02-28 2005-09-16 JQH HAMMONS JOHN Q HOTELS INC HAMMONS JOHN Q HOTELS FINANCE CORP
17 928539 64107410 1983-12-22 1991-12-19 NEST NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
18 928539 64107410 1991-12-20 2005-06-30 NA NESTOR INC WAND NESTOR INVESTMENTS L P ET AL
19 1002819 00891130 1995-11-16 2002-12-02 ACNAF AIR CANADA INC AIR CANADA /QUEBEC/
20 1044278 35849710 1997-11-14 2000-06-07 FRND FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
21 1044278 35849710 2000-06-08 2007-08-30 FRN FRIENDLY ICE CREAM CORP NEW FRIENDLYS RESTAURANTS FRANCHISE INC
22 1096300 M4591910 2000-08-02 2001-08-01 FLRE FLOWARE WIRELESS SYSTEMS INC FLOWARE WIRELESS SYSTEMS LTD
23 1399815 66986W10 1999-08-18 2004-03-28 NOVA NOVAMED EYECARE INC NOVAMED EYECARE RESEARCH, INC
24 1399815 66986W10 2004-03-29 2010-05-31 NOVA NOVAMED INC NOVAMED EYECARE RESEARCH, INC
> View(unique_8_df %>% distinct(cik, cusip6) %>% 
> group_by(cik) %>% summarise(num = n()) %>% 
> filter(num == 1) %>% inner_join(unique_8_df) %>% 
> mutate(ncusip = cusip) %>% inner_join(stocknames, by = 'ncusip') %>% 
> inner_join(ciks) %>% 
> select(cik, ncusip, namedt, nameenddt, ticker, comnam, company_name) %>% 
> filter(cik %in% not_a_match_stocknames_ciks))
cik ncusip namedt nameenddt ticker comnam company_name
1 712468 25059510 1968-01-02 1996-09-27 DSO DE SOTO INC NARRAGANSETT FIRST FUND
2 825288 68219P10 1995-08-01 2004-02-13 ONTC ON TECHNOLOGY CORP INSURED MUNICIPALS INCOME TRUST SERIES 200
3 842180 05959110 1993-11-24 2001-02-27 BFR BANCO FRANCES DEL RIO DE LA PLA BANCO BILBAO VIZCAYA ARGENTARIA S A
4 842180 05959110 1993-11-24 2001-02-27 BFR BANCO FRANCES DEL RIO DE LA PLA BANCO BILBAO VIZCAYA ARGENTARIA, S.A.
5 900385 15043810 1986-12-17 1989-10-01 CEDR CEDAR INCOME FD 1 LTD AEGON USA INC
6 900385 15043810 1989-10-02 2000-02-28 CEDR CEDAR INCOME FUND LTD AEGON USA INC
7 943230 89816810 1995-06-07 2004-08-09 DJT TRUMP HOTELS & CASINO RESRTS IN TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151
8 1004278 35849710 1997-11-14 2000-06-07 FRND FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
9 1004278 35849710 2000-06-08 2007-08-30 FRN FRIENDLY ICE CREAM CORP NEW KEYSTONE INSTITUTIONAL TRUST
10 1010312 68267810 1980-12-10 1997-11-27 OKE ONEOK INC WESTAR CAPITAL INC
11 1010312 68267810 1980-12-10 1997-11-27 OKE ONEOK INC WESTAR INDUSTRIES INC
12 1021255 74037Q10 1993-02-12 1999-08-29 PFM PREFERRED INCOME MGT FD INC HOREJSI ENTERPRISES INC
13 1021255 74037Q10 1993-02-12 1999-08-29 PFM PREFERRED INCOME MGT FD INC HOREJSI INC
14 1062023 60753310 1999-02-05 2000-06-05 MMPT MODEM MEDIA POPPE TYSON INC OPTIMARK HOLDINGS INC
15 1062023 60753310 1999-02-05 2000-06-05 MMPT MODEM MEDIA POPPE TYSON INC OPTIMARK TECHNOLOGIES INC
16 1062023 60753310 2000-06-06 2004-10-15 MMPT MODEM MEDIA INC OPTIMARK HOLDINGS INC
17 1062023 60753310 2000-06-06 2004-10-15 MMPT MODEM MEDIA INC OPTIMARK TECHNOLOGIES INC
18 1101734 62885310 1999-07-09 2000-12-01 NRW N C L HOLDING ASA ARRASAS LTD
19 1113227 90337P10 1994-01-21 2000-05-03 USWI US WATS INC CAPSULE COMMUNICATIONS INC DE
20 1113227 90337P10 1994-01-21 2000-05-03 USWI US WATS INC CAPSULE COMMUNICATIONS INC
21 1135254 M7024010 2000-08-08 2019-12-31 MNDO MIND C T I LTD EISINGER MONICA
22 1135254 M7024010 2000-08-08 2019-12-31 MNDO MIND C T I LTD IANCU MONICA
23 1218991 97263M10 2002-02-08 2011-05-27 WBD WIMM BILL DANN FOODS O J S C YUSHVAEV GAVRIL A
bdcallen commented 4 years ago
crsp=# SELECT * FROM edgar.filings
crsp-# WHERE cik = 943230;
                  company_name                  | form_type |  cik   | date_filed |                 file_name
------------------------------------------------+-----------+--------+------------+--------------------------------------------
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 2001-03-29 | edgar/data/943230/0001116679-01-000657.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 2000-03-31 | edgar/data/943230/0000903112-00-000708.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 1999-04-01 | edgar/data/943230/0000903112-99-000570.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | S-6EL24   | 943230 | 1995-10-23 | edgar/data/943230/0000934850-95-003781.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 2002-03-27 | edgar/data/943230/0001116679-02-000831.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 485BPOS   | 943230 | 2002-11-27 | edgar/data/943230/0000045809-02-000091.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 2003-03-28 | edgar/data/943230/0001116679-03-001001.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 487       | 943230 | 1996-01-12 | edgar/data/943230/0000950130-96-000107.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 497       | 943230 | 1996-01-19 | edgar/data/943230/0000950130-96-000180.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 1997-02-28 | edgar/data/943230/0000903112-97-000315.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | SC 13G    | 943230 | 1997-02-12 | edgar/data/943230/0000924355-97-000013.txt
 TAX EXEMPT SECURITIES TRUST NEW YORK TRUST 151 | 24F-2NT   | 943230 | 1998-02-27 | edgar/data/943230/0000903112-98-000440.txt
(12 rows)

@iangow I don't know if you've been looking at my very partially formed comments above so far, but apart from the hilarity of this example, I guess there is a rather pertinent question raised. Is the name of the issuer always meant to be equal to the subject company? Because this is the foundation upon which the whole mapping rests (we scrape the cik of the subject company, not the filer, assuming logically that that is the cik corresponding to the issuer). What is going on in cases like this?

To summarise what I have above so far, I had a look at the 260 odd cases for which an 8-character cusip did not map to a corresponding 9-character one. I defined them in the dataframe unique_8_df. I initially found that, apart from 15 entries where the cusip was equal to SCHEDULE, that the entries for the cusips seemed like they could be legitimate cusips, at least from first glance.

Then I looked at the multiplicities of entries for each cik. So there were a few with more than one cusip6 matched to each cik in this set, with the particularly large outlier of cik = 903954. I then tried matching to stocknames and cusipm.issuer, and it seems that, though some of these cases could be matched by company names, there a quite a few wrong matches by company name, so these perhaps should be chucked out.

I then tried doing a number of joins to stocknames and cusipm.issuer with the cases which had a single cusip6. As you can see from the initial joins, a substantial number do join. I then looked through the sets, and picked out, in both cases, what I deem "soft matches" (not a hard match, up to acronyms, but cases where perhaps the main name of a firm exists in each (e.g MORGAN GRENFELL), with perhaps different company abbreviations (ie. maybe INC instead of CORP, or perhaps INC is missing altogether)), and then the cases where the names, at a glance, clearly don't match (from which I found and investigated the amusing example above). These are the cases we should perhaps keep, assuming the cusip is correct despite in some cases the names not matching.

In addition to the interesting case above, though, there is this case

> issuers %>% filter(grepl('^HOSOI', issuer_name_1)) %>% select(cusip6, issuer_check, issuer_name_1)
# A tibble: 1 x 3
  cusip6 issuer_check issuer_name_1            
  <chr>  <chr>        <chr>                    
1 441047 8            HOSOI GARDEN MORTUARY INC
> cusip_cik %>% filter(cik == 48544) %>% select(file_name, cusip, check_digit, cik, company_name, formats, cusip6)
# A tibble: 2 x 7
  file_name                                   cusip    check_digit   cik company_name              formats cusip6
  <chr>                                       <chr>          <int> <int> <chr>                     <chr>   <chr> 
1 edgar/data/48544/0000950150-96-000080.txt   44090300           3 48544 HOSOI GARDEN MORTUARY INC AC      440903
2 edgar/data/1007155/0000950150-96-000080.txt 44090300           3 48544 HOSOI GARDEN MORTUARY INC AC      440903

where clearly the wrong cusip has been matched. This could very well be a typo, though it would be hard to turn 441047 into 440903 by just typing. Could 4409030011 be an old cusip? Do stocknames and cusipm.issuer contain all previously assigned cusips, or just those in current existence?

iangow commented 4 years ago

@bdcallen I can't really follow what you are doing above. Your code snippets are incomplete. For complete code, the best way is to use RStudio on your own computer and the reprex::reprex() function to render code on the clipboard, which can then be pasted here.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET work_mem = '5GB'")
rs <- dbExecute(pg, "SET search_path TO edgar")

stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
cusip_cik <- tbl(pg, "cusip_cik_test")

cusip_9 <- 
    cusip_cik %>%
    filter(nchar(cusip)==9L, 
           right(cusip, 1L) == as.character(check_digit)) %>%
    group_by(cik, cusip) %>%
    summarize(num_filings = n()) %>%
    ungroup() %>%
    rename(cusip9 = cusip) %>%
    mutate(cusip = substr(cusip9, 1L, 8L)) %>%
    compute()

cusip_8 <- 
    cusip_cik %>%
    filter(nchar(cusip)==8L) %>%
    group_by(cik, cusip) %>%
    summarize(num_filings = n()) %>%
    ungroup() %>%
    compute()

cusip_8_extra <-
    cusip_8 %>%
    anti_join(cusip_9, by = c("cik", "cusip")) %>%
    arrange(desc(num_filings)) %>%
    compute()

cusip_8_extra %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 1828

cusip_8_extra %>%
    filter(num_filings >= 10) %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 198

cusip_8_extra %>%
    filter(num_filings >= 10)
#> # Source:     lazy query [?? x 3]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(num_filings)
#>        cik cusip    num_filings
#>      <int> <chr>    <int64>    
#>  1  948642 87927W10 424        
#>  2  811419 02355926  86        
#>  3  897448 02311107  70        
#>  4  818813 91736108  60        
#>  5  320121 8796B200  60        
#>  6 1422109 09548108  56        
#>  7 1297401 26153103  54        
#>  8 1175872 89462609  54        
#>  9  922475 4506B109  54        
#> 10  726601 13974105  52        
#> # … with more rows

Created on 2020-06-19 by the reprex package (v0.3.0)

iangow commented 4 years ago

I don't think we care about CUSIP-CIK combinations that don't appear in a number of filings. I think a minimum of about 10 works best, as this eliminates a lot of errors that show up in one or two filings.

So I think we have 198 combinations to examine. Looking at the first one, I'd say it's a valid match:

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET work_mem = '5GB'")
rs <- dbExecute(pg, "SET search_path TO edgar")

stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
cusip_cik <- tbl(pg, "cusip_cik_test")

cusip_cik %>% 
    filter(cusip %~% '^87927W') %>% 
    count(cusip, cik, company_name)
#> # Source:   lazy query [?? x 4]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cusip, cik
#>   cusip        cik company_name         n      
#>   <chr>      <int> <chr>                <int64>
#> 1 87927W10  948642 TELECOM ITALIA S P A 398    
#> 2 87927W10  948642 TIM S.p.A.            26    
#> 3 87927W10 1200598 TELECOM ITALIA SPA     2

cusip_cik %>% 
    filter(cik == 948642) %>% 
    count(cusip)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   cusip     n      
#>   <chr>     <int64>
#> 1 003497168   2    
#> 2 87927W10  424    
#> 3 87927Y201   8    
#> 4 T92778108   8    
#> 5 T92778124  12    
#> 6 <NA>       29

stocknames %>% 
    filter(ncusip %~% '^87927W') %>% 
    select(permno, ncusip, comnam)
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   permno ncusip   comnam            
#>    <int> <chr>    <chr>             
#> 1  81781 87927W10 TELECOM ITALIA SPA
#> 2  81782 87927W20 TELECOM ITALIA SPA

Created on 2020-06-19 by the reprex package (v0.3.0)

iangow commented 4 years ago

The next one is a bit more complicated. I'd guess that 02355926 is not a valid CUSIP. So we don't want it.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET work_mem = '5GB'")
rs <- dbExecute(pg, "SET search_path TO edgar")

stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
cusip_cik <- tbl(pg, "cusip_cik_test")
issue <- tbl(pg, sql("SELECT * FROM cusipm.issue"))

cik <- 948642
cusip <- "02355926"
cusip6 <- paste0('^', substr(cusip, 1, 6))

cusip_cik %>% 
    filter(cusip %~% cusip6) %>% 
    count(cusip, cik, company_name)
#> # Source:   lazy query [?? x 4]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cusip, cik
#>   cusip        cik company_name               n      
#>   <chr>      <int> <chr>                      <int64>
#> 1 023559206 811419 AMERALIA INC                6     
#> 2 02355926  811419 AMERALIA INC               80     
#> 3 02355926  811419 Natural Resources USA Corp  6

cusip_cik %>% 
    filter(cik == cik) %>% 
    count(cusip)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    cusip     n      
#>    <chr>     <int64>
#>  1 000000     2     
#>  2 00000000  12     
#>  3 000000000 50     
#>  4 000015228  2     
#>  5 000020671  6     
#>  6 0000206F1  6     
#>  7 000022551 48     
#>  8 000049331  2     
#>  9 000050811  2     
#> 10 00006425   6     
#> # … with more rows

stocknames %>% 
    filter(ncusip %~% cusip6) %>% 
    select(permno, ncusip, comnam)
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   permno ncusip   comnam      
#>    <int> <chr>    <chr>       
#> 1  11642 02355910 AMERALIA INC
#> 2  11642 02355920 AMERALIA INC
#> 3  11642 02355920 AMERALIA INC

issue %>% filter(issuer_num == cusip6)
#> # Source:   lazy query [?? x 69]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 69 variables: issuer_num <chr>, issue_num <chr>, issue_check <chr>,
#> #   issue_desc_1 <chr>, issue_desc_2 <chr>, issue_adl_1 <chr>,
#> #   issue_adl_2 <chr>, issue_adl_3 <chr>, issue_adl_4 <chr>,
#> #   issue_status <chr>, dated_date <date>, maturity_date <date>,
#> #   partial_maturity <dbl>, rate <dbl>, issue_del_date <date>,
#> #   issue_transaction <chr>, issue_update_date <date>, cusip8 <chr>,
#> #   alternative_min_tax <chr>, bank_q <chr>, callable <chr>,
#> #   activity_date <date>, first_coupon_date <date>, init_pub_off <chr>,
#> #   payment_frequency <chr>, currency_code <chr>, domicile_code <chr>,
#> #   underwriter <chr>, us_cfi_code <chr>, closing_date <date>,
#> #   ticker_symbol <chr>, iso_cfi <chr>, depos_eligible <chr>, pre_refund <chr>,
#> #   refundable <chr>, remarketed <chr>, sinking_fund <chr>, taxable <chr>,
#> #   form <chr>, enhancements <chr>, fund_distrb_policy <chr>,
#> #   fund_inv_policy <chr>, fund_type <chr>, guarantee <chr>, income_type <chr>,
#> #   insured_by <chr>, ownership_restr <chr>, payment_status <chr>,
#> #   preferred_type <chr>, putable <chr>, rate_type <chr>, redemption <chr>,
#> #   source_doc <chr>, sponsoring <chr>, voting_rights <chr>,
#> #   warrant_assets <chr>, warrant_status <chr>, warrant_type <chr>,
#> #   where_traded <chr>, auditor <chr>, paying_agent <chr>, tender_agent <chr>,
#> #   xfer_agent_id <chr>, bond_counsel <chr>, financial_advisor <chr>,
#> #   competitive_sale_date <date>, sale_type <chr>, offering_amount <dbl>,
#> #   offering_amount_code <chr>

Created on 2020-06-19 by the reprex package (v0.3.0)

iangow commented 4 years ago

It seems we have 000000000 as a CUSIP for the previous case. We should probably kill these (obviously invalid) CUSIPs before populating the table.

bdcallen commented 4 years ago

So I have redefined cusip_cik to have a multiplicity variable.

cusip_cik <- cusip_cik %>% 
                    group_by(cik, cusip) %>% 
                    summarise(multiplicity = n()) %>% 
                    inner_join(cusip_cik) %>% ungroup()

cusip_cik <- cusip_cik %>% 
                    group_by(cik, cusip6) %>% 
                    summarise(n_cusip6 = n()) %>% 
                    inner_join(cusip_cik) %>% ungroup()

has_cusip9_match <- cusip_cik %>% filter(cusip_length == 8) %>% 
    inner_join(cusip_cik, by = 'cik') %>% 
    filter(cusip_length.y == 9) %>%
     filter(cusip.x == substr(cusip.y, 1, 8)) %>% distinct(cik, cusip.x) %>% 
     rename(cusip = cusip.x)

has_cusip9_match <- has_cusip9_match %>% 
                                   inner_join(cusip_cik) %>% 
                                   distinct(cik, cusip, cusip6, multiplicity, n_cusip6)

unique_8_df <- cusip_cik %>% filter(cusip_length == 8) %>% 
               anti_join(has_cusip9_match, by = c('cik', 'cusip')) %>% 
               distinct(cik, cusip) %>% mutate(cusip6 = substr(cusip, 1, 6))

unique_8_df <- unique_8_df %>% 
                         inner_join(cusip_cik) %>% 
                         distinct(cik, cusip, cusip6, multiplicity, n_cusip6)

> unique_8_df %>% mutate(ncusip = cusip) %>% 
   inner_join(stocknames) %>% filter(multiplicity >= 10) %>% 
   distinct(cik, cusip) %>% head(10)
Joining, by = c("cusip", "ncusip")
# A tibble: 10 x 2
      cik cusip   
    <int> <chr>   
 1  26537 23371210
 2  45791 70355510
 3  97483 71891320
 4  99982 74100410
 5  99982 74100420
 6 723616 28511410
 7 729213 74791410
 8 737033 74835710
 9 750556 22463310
10 750556 22463320

> unique_8_df %>% mutate(ncusip = cusip) %>% inner_join(stocknames) %>% 
   filter(multiplicity >= 10) %>% 
   distinct(cik, cusip) %>% tail(14)
Joining, by = c("cusip", "ncusip")
# A tibble: 14 x 2
       cik cusip   
     <int> <chr>   
 1  750556 69149730
 2  750556 80653010
 3  750556 89235610
 4  791164 42219B10
 5  812482 80190410
 6  829499 46072310
 7  854873 00760G10
 8  882264 03852L10
 9  904081 63577110
10  904900 08372L10
11  948642 87927W10
12 1013050 74257410
13 1093367 02932810
14 1125259 14365830

> stockn_or_issuer <- bind_rows(issuer_matches, stockname_matches %>%
                                    anti_join(issuer_matches)) %>% 
                                    arrange(cik, cusip, cusip6)
Joining, by = c("cik", "cusip", "cusip6", "multiplicity", "n_cusip6", "company_name")
> stockn_or_issuer
# A tibble: 1,181 x 8
     cik cusip    cusip6 multiplicity n_cusip6 company_name                     issuer_name_1              comnam
   <int> <chr>    <chr>         <int>    <int> <chr>                            <chr>                      <chr> 
 1  2135 48625010 486250            4        4 AFFILIATED COMPUTER SERVICES INC KAUFMANN FD INC            NA    
 2  2491 36465410 364654           16       16 ALLIANCE GAMING CORP             GAMING & TECHNOLOGY INC    NA    
 3  3327 01306801 013068            2      326 ALBERTO CULVER CO                ALBERTO CULVER CO          NA    
 4  3941 01941107 019411            6       42 ALLIED PRODUCTS CORP /DE/        ALLIED PRODS CORP DEL      NA    
 5  3952 19483106 194831            2        2 ALLIED RESEARCH CORP             COLLINS & AIKMAN GROUP INC NA    
 6  4457 02359100 023591            6        6 AMERCO /NV/                      AMERCO INC                 NA    
 7  4707 02489810 024898            2        2 AMERICAN CAPITAL CORP            AMERICAN CAP CORP          NA    
 8  5009 02604201 026042            4       22 AMERICAN FILTRONA CORP           AMERICAN FILTRONA CORP     NA    
 9  5177 23426703 234267            2        2 AMDURA CORP                      DAL PETE CO                NA    
10  5981 30371108 303711            2        2 AMERICAN VANGUARD CORP           FAIRCHILD INDS INC         NA    
# … with 1,171 more rows
> stockn_or_issuer %>% filter(multiplicity >= 10)
# A tibble: 116 x 8
     cik cusip    cusip6 multiplicity n_cusip6 company_name                     issuer_name_1            comnam
   <int> <chr>    <chr>         <int>    <int> <chr>                            <chr>                    <chr> 
 1  2491 36465410 364654           16       16 ALLIANCE GAMING CORP             GAMING & TECHNOLOGY INC  NA    
 2  8038 04635208 046352           11       11 ASTREX INC                       ASTRALIS LTD             NA    
 3 16496 13152106 131521           72       72 CALPROP CORP                     CALVERT CASH RESVS       NA    
 4 26537 23371210 233712           12       12 DAEDALUS ENTERPRISES INC         DAEDALUS ENTERPRISES INC NA    
 5 33073 29403700 294037           22       94 VISKASE COMPANIES INC            ENVIRODYNE INDS INC      NA    
 6 45791 70355510 703555           10       36 HARRIS PAUL STORES INC           PAUL HARRIS STORES INC   NA    
 7 57139 5024709C 502470           26       52 LABARGE INC                      LA BARGE INC             NA    
 8 69067 62632010 626320           18       38 MUNSINGWEAR INC                  MUNSINGWEAR INC          NA    
 9 69067 62632010 626320           18       38 PREMIUMWEAR INC                  MUNSINGWEAR INC          NA    
10 70415 63713010 637130           10       20 NATIONAL PATENT DEVELOPMENT CORP NATIONAL PATENT DEV CORP NA    
# … with 106 more rows
> stockn_or_issuer %>% filter(n_cusip6 >= 10)
# A tibble: 382 x 8
     cik cusip    cusip6 multiplicity n_cusip6 company_name                         issuer_name_1             comnam
   <int> <chr>    <chr>         <int>    <int> <chr>                                <chr>                     <chr> 
 1  2491 36465410 364654           16       16 ALLIANCE GAMING CORP                 GAMING & TECHNOLOGY INC   NA    
 2  3327 01306801 013068            2      326 ALBERTO CULVER CO                    ALBERTO CULVER CO         NA    
 3  3941 01941107 019411            6       42 ALLIED PRODUCTS CORP /DE/            ALLIED PRODS CORP DEL     NA    
 4  5009 02604201 026042            4       22 AMERICAN FILTRONA CORP               AMERICAN FILTRONA CORP    NA    
 5  6260 03237116 032371            4       88 ANACOMP INC                          ANACOMP INC IND           NA    
 6  8038 04635208 046352           11       11 ASTREX INC                           ASTRALIS LTD              NA    
 7  8497 04621M02 04621M            4       18 ASSURANCEAMERICA CORP                ASSURANCEAMERICA CORP     NA    
 8 12779 09600513 096005            4       34 BLUE RIDGE REAL ESTATE CO            BLUE RIDGE REAL ESTATE CO NA    
 9 14637 11522300 115223            1       51 BROWN & SHARPE MANUFACTURING CO /DE/ BROWN & SHARPE MFG CO     NA    
10 14803 11566021 115660            2      136 BROWN TOM INC /DE                    BROWN TOM INC             NA    
# … with 372 more rows
iangow commented 4 years ago

@bdcallen I think what you need to do here is to investigate some more of the 289 CUSIP-CIK combinations I identified above. I examined two of these; I think we might want to do ~10-12 in total. Once you've investigated the issues we can figure out what to do with them.

bdcallen commented 4 years ago

@iangow I see the essential difference between our analyses now. I previously selected for 8-digit cusips which under inner joining back onto cusip_cik by cik, did not have any other record that had a 9 digit cusip, for that cik. Whereas, you have more or less done the same thing, but with the additional constraint of matching the 8-digit cusip to the first 8 digits of a record with a 9-digit cusip (I essentially repeated the approach in the post just above). Hence, the great disparity in the numbers of 8-digit cusips to look at before taking the multiplicities of the cik-cusip combinations into account (245 versus around 3000). After imposing a multiplicity of 10 or above, the number I had reduces to 20

has_cusip9_any <- cusip_cik %>% filter(cusip_length == 8) %>% 
    inner_join(cusip_cik, by = 'cik') %>% 
    filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% 
    rename(cusip = cusip.x)

eight_no_cusip9_df <- cusip_cik %>% filter(cusip_length == 8) %>% 
    anti_join(has_cusip9_any, by = c('cik', 'cusip')) %>% 
    distinct(cik, cusip) %>% mutate(cusip6 = substr(cusip, 1, 6))

eight_no_cusip9_df <- eight_no_cusip9_df %>% inner_join(cusip_cik) %>% distinct(cik, cusip, cusip6, multiplicity, n_cusip6)

> eight_no_cusip9_df %>% filter(multiplicity >= 10)
# A tibble: 20 x 5
       cik cusip    cusip6 multiplicity n_cusip6
     <int> <chr>    <chr>         <int>    <int>
 1   74154 68267810 682678           18       18
 2   97483 71891320 718913           20       20
 3  719264 31946210 319462           15       15
 4  729213 74791410 747914           12       12
 5  737033 74835710 748357           14       16
 6  754128 06859200 068592           20       20
 7  777844 20992910 209929           12       14
 8  842180 05959110 059591           12       12
 9  882264 03852L10 03852L           18       18
10  895742 9900307A 990030           10       10
11  904081 63577110 635771           12       12
12  904900 08372L10 08372L           14       14
13  924835 88871110 888711           10       10
14 1010566 6911N103 6911N1           34       34
15 1013050 74257410 742574           10       10
16 1038370 8716Y107 8716Y1           10       10
17 1064133 98144Q10 98144Q           12       12
18 1093367 02932810 029328           12       12
19 1113227 90337P10 90337P           34       34
20 1520358 57660106 576601           40       40

I think I made the initial join that way, on the assumption that any 9-digit cusip candidate is preferable for a given cik than an 8-digit one. Also, if an 8 digit cusip is essentially a 9-digit one, but with say the 4th digit omitted by mistake, that will not be picked up when we anti-join only the 9-digit cases where the first 8 digits match the 8-digit cusip.

bdcallen commented 4 years ago

Here, in my more recent version of unique_8_df, I had 287 candidates after selecting for the required multiplicity

> unique_8_df %>% filter(multiplicity >= 10)
# A tibble: 287 x 5
     cik cusip    cusip6 multiplicity n_cusip6
   <int> <chr>    <chr>         <int>    <int>
 1  2491 36465410 364654           16       16
 2  3545 01630104 016301           12       12
 3  5133 02375204 023752           12       12
 4  6281 32654105 326541           10       10
 5  6292 SEECOVER SEECOV           14       14
 6  6715 PAGE2OF4 PAGE2O           12       12
 7  8038 04635208 046352           11       11
 8  9346 57755100 577551           10       10
 9 10119 68509603 685096           14       14
10 11544 84423102 844231           12       12
# … with 277 more rows

Then doing

stockname_matches <- unique_8_df %>% mutate(ncusip = cusip) %>% inner_join(stocknames) %>% distinct(cik, cusip)
issuer_matches <- unique_8_df %>% inner_join(issuers) %>% distinct(cik, cusip)

stockn_or_issuer <- bind_rows(issuer_matches, stockname_matches %>% anti_join(issuer_matches))
> stockn_or_issuer %>% inner_join(cusip_cik) %>% 
   filter(multiplicity >= 10) %>% 
   distinct(cusip, cik) %>% count()
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
      n
  <int>
1   112

I get that 112 of the 287 candidates can be joined to either stocknames or issuers.

bdcallen commented 4 years ago

@iangow I guess in some of my recent posts, I have tried to look at how defining how we match 8-digit cusip records to 9-digit ones changes things. For instance, should we eliminate the cases which can join to any 9-digit cusip? Or just those 9-digit cases where the multiplicity is more than or equal to 10?

> cusip8s <- cusip_cik %>% filter(cusip_length == 8)
> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% 
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x, multiplicity.x) %>% 
> rename(cusip = cusip.x, multiplicity = multiplicity.x) %>% 
> filter(multiplicity >= 10) %>% count() 
# A tibble: 1 x 1       
        n   
 <int> 
1   764 
> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% 
> filter(cusip_length.y == 9 && multiplicity.y >= 10) %>% 
> distinct(cik, cusip.x, multiplicity.x) %>% 
> rename(cusip = cusip.x, multiplicity = multiplicity.x) %>% 
> filter(multiplicity >= 10) %>% count() 
# A tibble: 1 x 1       
        n   
 <int> 
1   784

 

bdcallen commented 4 years ago

@iangow Can you explain this to me, especially when we both independently found there to be excess 8-digit cases in other analyses above?

cusip8s <- cusip_cik %>% filter(cusip_length == 8)
eight_w_valid9 <- cusip8s %>% 
                             inner_join(cusip_cik, by = 'cik') %>% 
                             filter(cusip_length.y == 9 && multiplicity.y >= 10) %>% 
                             distinct(cik, cusip.x, multiplicity.x) %>% 
                             rename(cusip = cusip.x, multiplicity = multiplicity.x)

> cusip8s %>% anti_join(eight_w_valid9)
Joining, by = c("cik", "cusip", "multiplicity")
# A tibble: 0 x 20
# … with 20 variables: cik <int>, cusip6 <chr>, n_cusip6 <int>, cusip <chr>, multiplicity <int>, file_name <chr>, check_digit <int>, company_name <chr>,
#   formats <chr>, cusip_length <int>, num_cusip_unique_char <dbl>, cusip8 <chr>, cusip7_1 <chr>, cusip6_2 <chr>, cusip5_3 <chr>, cusip4_4 <chr>, cusip3_5 <chr>,
#   cusip2_6 <chr>, cusip1_7 <chr>, cusip0_8 <chr>
bdcallen commented 4 years ago

Oh, I have joined just by cik this time initially, not by file_name and cik (for eight_w_valid9).

bdcallen commented 4 years ago

Also, not joining by file_name like I did initially meant many clearly unrelated 8's and 9's were matched. Doing this this time fixed the problem

cusip8s <- cusip_cik %>% filter(cusip_length == 8)
eight_w_valid9 <- cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% 
                  filter(cusip_length.y == 9 & multiplicity.y >= 10 
                         & substr(cusip.y, 9L, 9L) == as.character(check_digit.y) 
                         & substr(cusip.y, 1, 8) == cusip.x) %>% 
                  distinct(cik, cusip.x, multiplicity.x) %>% 
                  rename(cusip = cusip.x, multiplicity = multiplicity.x)

> cusip8s %>% anti_join(eight_w_valid9)
Joining, by = c("cik", "cusip", "multiplicity")
# A tibble: 16,250 x 20
     cik cusip6 n_cusip6 cusip multiplicity file_name check_digit company_name formats cusip_length num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4
   <int> <chr>     <int> <chr>        <int> <chr>           <int> <chr>        <chr>          <int>            <dbl> <chr>  <chr>    <chr>    <chr>    <chr>   
 1  1923 037881        2 0378…            2 edgar/da…           7 ABRAMS INDU… C                  8                6 03788… 0378810  0378816  0378806  0378106 
 2  1923 037881        2 0378…            2 edgar/da…           7 ABRAMS INDU… C                  8                6 03788… 0378810  0378816  0378806  0378106 
 3  1961 98159J        5 9815…            1 edgar/da…           1 Worlds.com,… D                  8                6 98159… 98159J1  98159J0  9815910  9815J10 
 4  2062 004162        4 0041…            4 edgar/da…           9 ACMAT CORP   C                  8                6 00416… 0041620  0041627  0041607  0041207 
 5  2062 004162        4 0041…            4 edgar/da…           9 ACMAT CORP   C                  8                6 00416… 0041620  0041627  0041607  0041207 
 6  2062 004162        4 0041…            4 edgar/da…           9 ACMAT CORP   C                  8                6 00416… 0041620  0041627  0041607  0041207 
 7  2062 004162        4 0041…            4 edgar/da…           9 ACMAT CORP   C                  8                6 00416… 0041620  0041627  0041607  0041207 
 8  2093 004724        4 0047…            2 edgar/da…           0 ACME METALS… D                  8                5 00472… 0047241  0047240  0047210  0047410 
 9  2093 004724        4 0047…            2 edgar/da…           0 ACME METALS… D                  8                5 00472… 0047241  0047240  0047210  0047410 
10  2135 486250        4 4862…            4 edgar/da…           3 AFFILIATED … C                  8                7 48625… 4862501  4862500  4862510  4862010 
# … with 16,240 more rows, and 4 more variables: cusip3_5 <chr>, cusip2_6 <chr>, cusip1_7 <chr>, cusip0_8 <chr>

> cusip8s %>% anti_join(eight_w_valid9) %>% distinct(cusip, cik) %>% count()
Joining, by = c("cik", "cusip", "multiplicity")
# A tibble: 1 x 1
      n
  <int>
1  3485

> cusip8s %>% anti_join(eight_w_valid9) %>% 
+     filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% count()
Joining, by = c("cik", "cusip", "multiplicity")
# A tibble: 1 x 1
      n
  <int>
1   335

Will repeat the other steps

iangow commented 4 years ago

OK. Please replace comments above that are erroneous so that when finalize, we only have what's relevant.

bdcallen commented 4 years ago
> match_stocknames <- cusip8s %>% anti_join(eight_w_valid9) %>% 
     filter(multiplicity >= 10) %>% 
     distinct(cusip, cik) %>% mutate(ncusip = cusip) %>% 
     inner_join(stocknames) %>% distinct(cusip, cik)
Joining, by = c("cik", "cusip", "multiplicity")
Joining, by = c("cusip", "ncusip")
> match_stocknames %>% count()
# A tibble: 1 x 1
      n
  <int>
1    48

> match_issuers <- cusip8s %>% anti_join(eight_w_valid9) %>% 
                          filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% 
                          mutate(cusip6 = substr(cusip, 1, 6)) %>% 
                          inner_join(issuers) %>% distinct(cusip, cik)

> match_issuers %>% count()
# A tibble: 1 x 1
      n
  <int>
1   157

match_either <- bind_rows(match_issuers, match_stocknames %>% 
                          anti_join(match_issuers, by = c('cik', 'cusip')))

> match_either %>% count()
# A tibble: 1 x 1
      n
  <int>
1   159

> 100.00 * 159/335
[1] 47.46269

> match_either %>% inner_join(cusip_cik, by = 'cik') %>% 
+     filter(cusip_length == 9 & multiplicity.y < 10 & 
+                multiplicity.x + multiplicity.y >= 10 & 
+                substr(cusip.y, 9L, 9L) == as.character(check_digit) & 
+                substr(cusip.y, 1, 8) == cusip.x) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    46

So I had 335, as opposed to 289, cases due to the constraint of having the 9-digit cusips have a multiplicity of at least 10. After fixing the AND operator mistake, just under half of the cases can be joined to either crsp.stocknames or cusipm.issuer (represented by match_either).

bdcallen commented 4 years ago

@iangow Here's a look at the first few entries with names for those cases that match cusipm.issuer (note I have changed the variable issuer_num to cusip6 is issuers all way through my analysis for convenience)

> match_issuers %>% left_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type) %>% head(20)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 20 x 8
     cik cusip    cusip6 issuer_name_1            company_name              issuer_adl_1                  issuer_adl_2                   issuer_type
   <int> <chr>    <chr>  <chr>                    <chr>                     <chr>                         <chr>                          <chr>      
 1  2491 36465410 364654 GAMING & TECHNOLOGY INC  ALLIANCE GAMING CORP      NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C          
 2  2491 36465410 364654 GAMING & TECHNOLOGY INC  BALLY TECHNOLOGIES, INC.  NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C          
 3  8038 04635208 046352 ASTRALIS LTD             ASTREX INC                FORMERLY ASTRALIS             PHARMACEUTICALS LTD 12/10/2001 C          
 4 16496 13152106 131521 CALVERT CASH RESVS       CALPROP CORP              NAME CHANGED TO MONEY MGMT    PLUS 09/10/1984     SEE 609918 C          
 5 26537 23371210 233712 DAEDALUS ENTERPRISES INC ARGON ST, INC.            NAME CHANGED TO SENSYS        TECHNOLOGIES INC 06/09/1998    C          
 6 26537 23371210 233712 DAEDALUS ENTERPRISES INC DAEDALUS ENTERPRISES INC  NAME CHANGED TO SENSYS        TECHNOLOGIES INC 06/09/1998    C          
 7 26537 23371210 233712 DAEDALUS ENTERPRISES INC SENSYS TECHNOLOGIES INC   NAME CHANGED TO SENSYS        TECHNOLOGIES INC 06/09/1998    C          
 8 26537 23371210 233712 DAEDALUS ENTERPRISES INC SENSYTECH INC             NAME CHANGED TO SENSYS        TECHNOLOGIES INC 06/09/1998    C          
 9 33073 29403700 294037 ENVIRODYNE INDS INC      ENVIRODYNE INDUSTRIES INC FORMERLY ENVIRODYNE INC TO    03/03/1978 AND NAME CHANGED TO C          
10 33073 29403700 294037 ENVIRODYNE INDS INC      VISKASE COMPANIES INC     FORMERLY ENVIRODYNE INC TO    03/03/1978 AND NAME CHANGED TO C          
11 38264 34986210 349862 FORWARD INDS INC N Y     FORWARD INDUSTRIES INC    NA                            NA                             C          
12 38264 34986210 349862 FORWARD INDS INC N Y     FORWARD INDUSTRIES, INC.  NA                            NA                             C          
13 45791 70355510 703555 PAUL HARRIS STORES INC   HARRIS PAUL STORES INC    NA                            NA                             C          
14 45791 70355510 703555 PAUL HARRIS STORES INC   PAUL HARRIS STORES INC    NA                            NA                             C          
15 57139 5024709C 502470 LA BARGE INC             LABARGE INC               NA                            NA                             C          
16 59963 25474510 254745 DISTINCTIVE DEVICES INC  DISTINCTIVE DEVICES INC   NA                            NA                             C          
17 65270 59152010 591520 METHODE ELECTRS INC      METHODE ELECTRONICS INC   NA                            NA                             C          
18 69067 62632010 626320 MUNSINGWEAR INC          MUNSINGWEAR INC           NAME CHANGED TO PREMIUMWEAR   INC 09/06/1996      SEE 740909 C          
19 69067 62632010 626320 MUNSINGWEAR INC          PREMIUMWEAR INC           NAME CHANGED TO PREMIUMWEAR   INC 09/06/1996      SEE 740909 C          
20 70415 63713010 637130 NATIONAL PATENT DEV CORP GP STRATEGIES CORP        NAME CHANGED TO GP STRATEGIES CORP 03/09/1998 SEE 36225V     C   
bdcallen commented 4 years ago

Seems like the first entry above is wrong.

> issuers %>% filter(grepl("^ALLIANCE GAMING CORP", issuer_name_1)) %>% mutate(issuer_name = paste0(na.omit(issuer_name_1, issuer_name_2, issuer_name_3), collapse = ''), issuer_adl = paste0(na.omit(issuer_adl_1, issuer_adl_2, issuer_adl_3), collapse = '')) %>% select(cusip6, issuer_name, issuer_adl, issuer_status)
# A tibble: 1 x 4
  cusip6 issuer_name          issuer_adl            issuer_status
  <chr>  <chr>                <chr>                 <chr>        
1 01859P ALLIANCE GAMING CORP NAME CHANGED TO BALLY D            
> issuers %>% filter(cusip6 == '05874B') %>% mutate(issuer_name = paste0(na.omit(issuer_name_1, issuer_name_2, issuer_name_3), collapse = ''), issuer_adl = paste0(na.omit(issuer_adl_1, issuer_adl_2, issuer_adl_3), collapse = '')) %>% select(cusip6, issuer_name, issuer_adl, issuer_status)
# A tibble: 1 x 4
  cusip6 issuer_name            issuer_adl issuer_status
  <chr>  <chr>                  <chr>      <chr>        
1 05874B BALLY TECHNOLOGIES INC ""         A            
> issuers %>% filter(grepl("^GAMING & TECHNOLOGY INC", issuer_name_1)) %>% mutate(issuer_name = paste0(na.omit(issuer_name_1, issuer_name_2, issuer_name_3)), issuer_adl = paste0(na.omit(issuer_adl_1, issuer_adl_2, issuer_adl_3))) %>% select(cusip6, issuer_name, issuer_adl, issuer_status)
# A tibble: 1 x 4
  cusip6 issuer_name             issuer_adl                    issuer_status
  <chr>  <chr>                   <chr>                         <chr>        
1 364654 GAMING & TECHNOLOGY INC NAME CHANGED TO UNITED GAMING D 
bdcallen commented 4 years ago
> match_issuers %>% left_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type) %>% filter(company_name == issuer_name_1)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 41 x 8
      cik cusip    cusip6 issuer_name_1            company_name             issuer_adl_1                 issuer_adl_2                   issuer_type
    <int> <chr>    <chr>  <chr>                    <chr>                    <chr>                        <chr>                          <chr>      
 1  26537 23371210 233712 DAEDALUS ENTERPRISES INC DAEDALUS ENTERPRISES INC NAME CHANGED TO SENSYS       TECHNOLOGIES INC 06/09/1998    C          
 2  45791 70355510 703555 PAUL HARRIS STORES INC   PAUL HARRIS STORES INC   NA                           NA                             C          
 3  59963 25474510 254745 DISTINCTIVE DEVICES INC  DISTINCTIVE DEVICES INC  NA                           NA                             C          
 4  69067 62632010 626320 MUNSINGWEAR INC          MUNSINGWEAR INC          NAME CHANGED TO PREMIUMWEAR  INC 09/06/1996      SEE 740909 C          
 5  74154 68267810 682678 ONEOK INC                ONEOK INC                REORGANIZED AS ONEOK INC NEW 11/28/1997                     C          
 6 216039 40009510 400095 GRUBB & ELLIS CO         GRUBB & ELLIS CO         NA                           NA                             C          
 7 319156 77742900 777429 ROSELAND OIL & GAS INC   ROSELAND OIL & GAS INC   REORGANIZED AS CUBIC ENERGY  INC  10/04/1999 SEE 229675     C          
 8 720851 64107410 641074 NESTOR INC               NESTOR INC               NA                           NA                             C          
 9 729213 74791410 747914 Q MED INC                Q MED INC                NA                           NA                             C          
10 737033 74835710 748357 QUESTECH INC             QUESTECH INC             NA                           NA                             C          
# … with 31 more rows

> match_issuers %>% left_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type) %>% filter(company_name == issuer_name_1) %>% distinct(cusip, cik) %>% count()
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 1
      n
  <int>
1    41

So 41 of the of the 159 in match_issuers can be called definitive matches by company_name == issuer_name_1. This provides a lower bound on how good the matches in match_issuers are (it could be better than this, knowing how inconsistent different databases can be in writing the name of some company).

bdcallen commented 4 years ago

@iangow So I've done this (getting rid of spaces in both names got 3 or 4 more matches)

> View(match_issuers %>% 
             anti_join(match_issuers %>% 
                            left_join(issuers) %>% 
                            left_join(ciks) %>% 
                            distinct(cik, cusip, cusip6, issuer_name_1, 
               company_name, issuer_adl_1, issuer_adl_2, issuer_type) %>%                  
                            mutate(collapsed_issuer_name = gsub(' ', '', issuer_name_1), 
       collapsed_company_name = gsub(' ', '', company_name)) %>% 
                        filter(collapsed_issuer_name == collapsed_company_name) %>% 
                         distinct(cusip, cik, company_name, issuer_name_1)) %>% 
                 left_join(issuers) %>% left_join(ciks) %>% 
     distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type))
cik cusip cusip6 issuer_name_1 company_name issuer_adl_1 issuer_adl_2 issuer_type
1 2491 36465410 364654 GAMING & TECHNOLOGY INC ALLIANCE GAMING CORP NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C
2 2491 36465410 364654 GAMING & TECHNOLOGY INC BALLY TECHNOLOGIES, INC. NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C
3 8038 04635208 046352 ASTRALIS LTD ASTREX INC FORMERLY ASTRALIS PHARMACEUTICALS LTD 12/10/2001 C
4 16496 13152106 131521 CALVERT CASH RESVS CALPROP CORP NAME CHANGED TO MONEY MGMT PLUS 09/10/1984     SEE 609918 C
5 33073 29403700 294037 ENVIRODYNE INDS INC ENVIRODYNE INDUSTRIES INC FORMERLY ENVIRODYNE INC TO 03/03/1978 AND NAME CHANGED TO C
6 33073 29403700 294037 ENVIRODYNE INDS INC VISKASE COMPANIES INC FORMERLY ENVIRODYNE INC TO 03/03/1978 AND NAME CHANGED TO C
7 38264 34986210 349862 FORWARD INDS INC N Y FORWARD INDUSTRIES INC NA NA C
8 38264 34986210 349862 FORWARD INDS INC N Y FORWARD INDUSTRIES, INC. NA NA C
9 65270 59152010 591520 METHODE ELECTRS INC METHODE ELECTRONICS INC NA NA C
10 70415 63713010 637130 NATIONAL PATENT DEV CORP GP STRATEGIES CORP NAME CHANGED TO GP STRATEGIES CORP 03/09/1998 SEE 36225V C
11 70415 63713010 637130 NATIONAL PATENT DEV CORP NATIONAL PATENT DEVELOPMENT CORP NAME CHANGED TO GP STRATEGIES CORP 03/09/1998 SEE 36225V C
12 72843 66526286 665262 NORTHERN IND PUB SVC CO NORTHERN INDIANA PUBLIC SERVICE CO NA NA C
13 74783 68417600 684176 ORANGE-CO INC ACCEPTANCE INSURANCE COMPANIES INC NAME CHANGED TO STONERIDGE RES INC 04/22/1987      SEE 861839 C
14 89140 81773232 817732 SERVOTRONICS INC SERVOTRONICS INC /DE/ NA NA C
15 97483 71891320 718913 PHOENIX RESOURCE COS INC PHOENIX RESOURCE COMPANIES INC NA NA C
16 99982 74100410 741004 PRESIDENTIAL RLTY CORP NEW TRUST CO OF NEW JERSEY NA NA C
17 99982 74100420 741004 PRESIDENTIAL RLTY CORP NEW TRUST CO OF NEW JERSEY NA NA C
18 100441 90212010 902120 TYCO INTL LTD TYCO INTERNATIONAL LTD FORMERLY TYCO LABS INC TO 11/10/1993 & REORGANIZED AS C
19 105096 94356710 943567 WAVEPHORE INC WAXMAN INDUSTRIES INC NAME CHANGED TO WAVO CORP 06/23/1999          SEE 944027 C
20 203248 84402810 844028 SOUTHERN UN CO SOUTHERN UNION CO NA NA C
21 215403 15505106 155051 CENTRAL RR CO N J CENTRAL RESERVE LIFE CORP REORGANIZED AS CENTRAL JERSEY INDS INC 08/15/1979 C
22 215403 15505106 155051 CENTRAL RR CO N J CERES GROUP INC REORGANIZED AS CENTRAL JERSEY INDS INC 08/15/1979 C
23 704874 60520010 605200 MISSION WEST PPTYS MISSION WEST PROPERTIES/NEW/ REORGANIZED AS MISSION WEST PPTYS INC MD 12/30/1998 C
24 715355 40273410 402734 GULFSTREAM AEROSPACE CORP NEW GULFSTREAM AEROSPACE CORP NA NA C
25 719264 31946210 319462 FIRST CTZNS BANCSHARES INC FIRST CITIZENS BANCSHARES INC /TN/ NA NA C
26 721237 65488410 654884 NOBEL ED DYNAMICS INC NOBEL EDUCATION DYNAMICS INC NAME CHANGED TO NOBEL LEARNING CMNTYS INC 12/07/1998 C
27 721237 65488410 654884 NOBEL ED DYNAMICS INC NOBEL LEARNING COMMUNITIES INC NAME CHANGED TO NOBEL LEARNING CMNTYS INC 12/07/1998 C
28 723616 28511410 285114 ELECTRO KINETIC SYS INC CRYSTAL PROPERTIES HOLDINGS, INC. NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
29 723616 28511410 285114 ELECTRO KINETIC SYS INC ELECTRO KINETIC SYSTEMS INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
30 723616 28511410 285114 ELECTRO KINETIC SYS INC STERLING CAPITAL INVESTMENT GROUP INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
31 723616 28511410 285114 ELECTRO KINETIC SYS INC STERLING MEDIA CAPITAL GROUP INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
32 724024 28882108 288821 ELLIS GABLES APT BLDG CORP AMERICAN PHYSICIANS SERVICE GROUP INC NA NA C
33 733289 37935510 379355 GLOBAL NAT RES INC GLOBAL NATURAL RESOURCES INC /NJ/ NA NA C
34 743872 43761400 437614 HOMESTAKE MNG CO HOMESTAKE MINING CO /DE/ NAME CHANGED TO BARRICK GOLD CORP 12/04/2001 SEE 067901 C
35 750556 22463310 224633 CRAWFORD & CO SUNTRUST BANKS INC NA NA C
36 750556 22463320 224633 CRAWFORD & CO SUNTRUST BANKS INC NA NA C
37 750556 69149730 691497 OXFORD INDS INC SUNTRUST BANKS INC NA NA C
38 750556 78125810 781258 RUDDICK CORP SUNTRUST BANKS INC NAME CHANGED TO HARRIS TEETER SUPERMARKETS INC 04/02/2012 C
39 750556 80653010 806530 SCHERER HEALTHCARE INC SUNTRUST BANKS INC ACQUIRED BY STERICYCLE INC 01/09/2003 C
40 750556 89235610 892356 TRACTOR SUPPLY CO SUNTRUST BANKS INC NA NA C
41 757011 90329405 903294 U S GULF HLDGS INC USG CORP NAME CHANGED TO CANCOM TECHNOLOGIES INC 04/29/1997 C
42 771498 07159310 071593 BATTLE MTN GOLD CO BATTLE MOUNTAIN GOLD CO NA NA C
43 774487 03077110 030771 AMES W R CO AMERIWOOD INDUSTRIES INTERNATIONAL CORP ACQUIRED BY RUCKER CO NA C
44 781902 94091200 940912 WATERFORD CAP MGMT INC INTERNATIONAL BALER CORP REORGANIZED AS CPI PLASTICS GOUP LTD 09/21/1998 SEE 12615N C
45 781902 94091200 940912 WATERFORD CAP MGMT INC WASTE TECHNOLOGY CORP REORGANIZED AS CPI PLASTICS GOUP LTD 09/21/1998 SEE 12615N C
46 787849 98502114 985021 YARDVILLE NATL BANCORP YARDVILLE NATIONAL BANCORP NA NA C
47 789879 29356300 293563 ENS BIO LOGICALS INC ENSURGE INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
48 789879 29356300 293563 ENS BIO LOGICALS INC ISHOPPER COM INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
49 789879 29356300 293563 ENS BIO LOGICALS INC SUNWALKER DEVELOPMENT INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
50 791050 02882810 028828 AMERICAN PASSAGE MKTG CORP PHOTOWORKS INC /WA NAME CHANGED TO SEATTLE FILMWORKS INC 02/24/1989 C
51 791050 02882810 028828 AMERICAN PASSAGE MKTG CORP SEATTLE FILMWORKS INC NAME CHANGED TO SEATTLE FILMWORKS INC 02/24/1989 C
52 791164 42219B10 42219B HEALTH MGMT INC DEL HEALTH MANAGEMENT INC/DE NA NA C
53 791164 42219B10 42219B HEALTH MGMT INC DEL HOMECARE MANAGEMENT INC NA NA C
54 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS MAUNA LOA MACADAMIA PARTNERS LP NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
55 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS ML MACADAMIA ORCHARDS L P NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
56 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS ROYAL HAWAIIAN ORCHARDS, L.P. NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
57 804138 02711807 027118 AMERICAN LD LEASE INC AMERICAN LAND LEASE INC NA NA C
58 804138 02711807 027118 AMERICAN LD LEASE INC ASSET INVESTORS CORP NA NA C
59 811641 45252306 452523 IMMERSION HUMAN INTERFACE CORP IMMUCELL CORP /DE/ NA NA C
60 817135 08606103 086061 BERTUCCIS CORP ALLIANCE HEALTHCARE SERVICES, INC NA NA C
61 817135 08606103 086061 BERTUCCIS CORP ALLIANCE IMAGING INC /DE/ NA NA C
62 829499 46072310 460723 INTERSTATE BAKERIES CORP NEW INTERSTATE BAKERIES CORP/DE/ MERGED INTO INTERSTATE BRANDS CORP 06/02/1990 C
63 831532 45070710 450707 IVI PUBG INC ATRIX INTERNATIONAL INC REORGANIZED AS ONHEALTH NETWORK CO 06/16/1998 C
64 833203 24231700 242317 DEALS ARE GOOD INC LINKON CORP INC NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
65 833203 24231700 242317 DEALS ARE GOOD INC LINKON CORP NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
66 833203 24231700 242317 DEALS ARE GOOD INC PACKETPORT COM NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
67 833203 24231700 242317 DEALS ARE GOOD INC POCKETPORT COM NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
68 833203 24231700 242317 DEALS ARE GOOD INC WYNDSTORM CORP NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
69 840260 53803230 538032 LIVE ENTMT INC LIVE ENTERTAINMENT INC NAME CHANGED TO CANCER TREATMENT CTRS INC 07/30/2002 C
70 841528 75952U10 75952U RELIASTAR FINL CORP NWNL COMPANIES INC MERGED INTO ING GROEP N.V. 09/01/2000 C
71 841528 75952U10 75952U RELIASTAR FINL CORP RELIASTAR FINANCIAL CORP MERGED INTO ING GROEP N.V. 09/01/2000 C
72 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA S A NAME CHANGED TO BBVA BANCO FRANCES S A 10/04/2000 C
73 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA, S.A. NAME CHANGED TO BBVA BANCO FRANCES S A 10/04/2000 C
74 845609 38070110 380701 GOLD RIV HOTEL & CASINO CORP GOLD RIVER HOTEL & CASINO CORP NA NA C
75 846615 46624B30 46624B JPY HLDGS LTD JPS INDUSTRIES INC NA NA C
76 846615 46624B30 46624B JPY HLDGS LTD JPS TEXTILE GROUP INC /DE/ NA NA C
77 849502 75190710 751907 RAMTRON INTL CORP RAMTRON INTERNATIONAL CORP NA NA C
78 849547 91826107 918261 VLPS LTG SVCS INTL INC BLACK BOX CORP NA NA C
79 849547 91826107 918261 VLPS LTG SVCS INTL INC MB HOLDINGS INC NA NA C
80 850693 18490102 184901 CLEARLY CDN BEVERAGE CORP ALLERGAN INC NA NA C
81 854873 00760G10 00760G AEGIS CONSUMER FDG GROUP INC RALLYS HAMBURGERS INC NA NA C
82 858339 74342A10 74342A PROMUS COS INC CAESARS ENTERTAINMENT CORP NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
83 858339 74342A10 74342A PROMUS COS INC HARRAHS ENTERTAINMENT INC NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
84 858339 74342A10 74342A PROMUS COS INC PROMUS COMPANIES INC NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
85 869709 26831802 268318 EDG CAP INC EDG CAPITAL INC NAME CHANGED TO ISOTOPE SOLUTIONS GROUP INC 11/14/2001 C
86 869709 26831802 268318 EDG CAP INC ISOTOPE SOLUTIONS GROUP INC NAME CHANGED TO ISOTOPE SOLUTIONS GROUP INC 11/14/2001 C
87 872248 68393020 683930 OPTIMA PETE CORP OPTIMA PETROLEUM CORP REORGANIZED AS PETROQUEST ENERGY INC 09/04/1998 C
88 872248 68393020 683930 OPTIMA PETE CORP PETROQUEST ENERGY INC REORGANIZED AS PETROQUEST ENERGY INC 09/04/1998 C
89 874737 59512710 595127 MICRONICS COMPUTERS INC MICRONICS COMPUTERS INC /CA NA NA C
90 875459 92656M10 92656M VIDEO LOTTERY TECHNOLOGIES INC POWERHOUSE TECHNOLOGIES INC /DE NAME CHANGED TO POWERHOUSE TECHNOLOGIES INC C
91 875459 92656M10 92656M VIDEO LOTTERY TECHNOLOGIES INC VIDEO LOTTERY TECHNOLOGIES INC/DE NAME CHANGED TO POWERHOUSE TECHNOLOGIES INC C
92 879585 04907910 049079 ATLANTIC TELE-NETWORK INC ATLANTIC TELE NETWORK INC /DE NA NA C
93 879585 04907910 049079 ATLANTIC TELE-NETWORK INC ATN INTERNATIONAL, INC. NA NA C
94 882159 23922010 239220 DAW TECHNOLOGIES INC DAW TECHNOLOGIES INC /UT NA NA C
95 882235 48625010 486250 KAUFMANN FD INC LINCARE HOLDINGS INC NA NA C
96 882291 00808Y09 00808Y AETHLON MED INC AETHLON MEDICAL INC NA NA C
97 882291 00808Y09 00808Y AETHLON MED INC BISHOP EQUITIES INC NA NA C
98 884802 91382810 913828 UNIVERSAL SEISMIC ASSOC INC UNIVERSAL SEISMIC ASSOCIATES INC NAME CHANGED TO POCKETOP CORP 11/14/2006          SEE 73035T C
99 885942 84476910 844769 SOUTHWEST BANCSHARES INC DEL SOUTHWEST BANCSHARES INC /NEW/ NA NA C
100 887353 81804310 818043 SEVEN-UP / RC BOTTLING CO SEVEN UP RC BOTTLING COMPANY OF SOUTHERN CALIFORNIA INC NA NA C
101 888999 59114210 591142 METALCLAD CORP METRA BIOSYSTEMS INC NAME CHANGED TO XNTRX CORP 07/03/2002          SEE 293850 C
102 893813 70757610 707576 PENN PAC CORP PENN OCTANE CORP NA NA C
103 895422 74037Q10 74037Q PREFERRED INCOME MGMT FD INC BOULDER TOTAL RETURN FUND INC NAME CHANGED TO BOULDER TOTAL RETURN FD INC 08/27/1999 C
104 895422 74037Q10 74037Q PREFERRED INCOME MGMT FD INC PREFERRED INCOME MANAGEMENT FUND INC NAME CHANGED TO BOULDER TOTAL RETURN FD INC 08/27/1999 C
105 897448 02311107 023111 AMARIN CORP PLC AMARIN CORP PLC\UK NA NA C
106 897448 02311107 023111 AMARIN CORP PLC AMARIN PHARMACEUTICALS PLC NA NA C
107 897448 02311107 023111 AMARIN CORP PLC ETHICAL HOLDINGS PLC NA NA C
108 904081 63577110 635771 NATIONAL ED CORP STECK VAUGHN PUBLISHING CORP NA NA C
109 904900 08372L10 08372L BERG ELECTRS CORP BERG ELECTRONICS CORP /DE/ NA NA C
110 910468 48625010 486250 KAUFMANN FD INC FLOORING AMERICA INC NA NA C
111 910468 48625010 486250 KAUFMANN FD INC MAXIM GROUP INC / NA NA C
112 915197 43153101 431531 HILLCREST RES LTD HILITE INDUSTRIES INC NA NA C
113 919239 38095610 380956 GOLDCORP INC NEW GOLDCORP INC NA NA C
114 924835 88871110 888711 TIVOLI INDS INC TIVOLI INDUSTRIES INC NA NA C
115 928421 57723100 577231 MATZEL & MUMFORD MTG FDG INC MATTSON TECHNOLOGY INC NA NA C
116 943324 20741M03 20741M CONMED HEALTHCARE MGMT INC CONMED HEALTHCARE MANAGEMENT, INC. NA NA C
117 943324 20741M03 20741M CONMED HEALTHCARE MGMT INC PACE HEALTH MANAGEMENT SYSTEMS INC NA NA C
118 946842 48625010 486250 KAUFMANN FD INC AMBASSADORS INTERNATIONAL INC NA NA C
119 1005757 12634109 126341 CS FIRST BOSTON INVT FDS INC CSG SYSTEMS INTERNATIONAL INC NAME CHANGED TO BEA INVT FDS INC 06/13/1995      SEE 05538D C
120 1006835 82617012 826170 SIEBEL SYS INC SIEBEL SYSTEMS INC REORGANIZED AS SIEBEL JANNA ARRANGEMENT INC 11/14/2000 C
121 1013050 74257410 742574 PRINTRAK INTL INC PRINTRAK INTERNATIONAL INC NA NA C
122 1014383 68340210 683402 ONYX SOFTWARE CORP ONYX SOFTWARE CORP/WA NA NA C
123 1017172 53226107 532261 LIGHTSHIP TANKERS IV LLC AUTHORIZE.NET HOLDINGS, INC. NA NA C
124 1017172 53226107 532261 LIGHTSHIP TANKERS IV LLC LIGHTBRIDGE INC NA NA C
125 1032462 90297110 902971 U S RESTAURANT PPTYS INC FF-TSY HOLDING CO II, LLC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
126 1032462 90297110 902971 U S RESTAURANT PPTYS INC TRUSTREET PROPERTIES INC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
127 1032462 90297110 902971 U S RESTAURANT PPTYS INC U S RESTAURANT PROPERTIES INC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
128 1035991 36188107 361881 GMD BONDHOLDER TR FIRST ROBINSON FINANCIAL CORP NA NA C
129 1039684 68267810 682678 ONEOK INC ONEOK INC /NEW/ REORGANIZED AS ONEOK INC NEW 11/28/1997 C
130 1039684 68267810 682678 ONEOK INC WAI INC REORGANIZED AS ONEOK INC NEW 11/28/1997 C
131 1041858 74936101 749361 RCN CORP RCN CORP /DE/ NA NA C
132 1042618 65365610 653656 NICE SYS LTD NICE SYSTEMS LTD  /ADR/ NA NA C
133 1043044 50194100 501941 LDF INC LNR PROPERTY CORP NAME CHANGED TO FIRST CHICAGO BANCORP 11/22/2006  SEE 31948J C
134 1044979 45004F10 45004F IRI INTL CORP IRI INTERNATIONAL CORP NA NA C
135 1050122 68197104 681971 OMNI ADVANTAGE INC 1 800 CONTACTS INC NAME CHANGED TO GO CALL INC 03/16/1998          SEE 380163 C
136 1061763 44934106 449341 IC WKS INC DEL HUNGRY MINDS INC /DE/ NA NA C
137 1061763 44934106 449341 IC WKS INC DEL IDG BOOKS WORLDWIDE INC NA NA C
138 1064133 98144Q10 98144Q WORLD COMM ONLINE INC DEL WORLD COMMERCE ONLINE INC NA NA C
139 1065059 9033E108 9033E1 U S BK NATL ASSN MINNEAPOLIS CENTRUS ENERGY CORP NA NA C
140 1065059 9033E108 9033E1 U S BK NATL ASSN MINNEAPOLIS USEC INC NA NA C
141 1069258 33385515 333855 FIRST NATL BK SOUTHN MD KRATOS DEFENSE & SECURITY SOLUTIONS, INC. ACQUIRED BY FIRST MD BANCORP 08/15/1983 C
142 1069258 33385515 333855 FIRST NATL BK SOUTHN MD WIRELESS FACILITIES INC ACQUIRED BY FIRST MD BANCORP 08/15/1983 C
143 1075046 89732918 897329 TROY FINL CORP TROY FINANCIAL CORP NA NA C
144 1093367 02932810 029328 AMERICAN RIVS OIL CO AMERICAN RIVERS OIL CO /DE/ REORGANIZED AS AROC INC 12/08/1999 C
145 1093367 02932810 029328 AMERICAN RIVS OIL CO AROC INC REORGANIZED AS AROC INC 12/08/1999 C
146 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC DE REORGANIZED AS CAPSULE COMMUNICATIONS INC 04/27/2000 C
147 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC REORGANIZED AS CAPSULE COMMUNICATIONS INC 04/27/2000 C
148 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ OMX GROUP, INC. FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
149 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ STOCK MARKET INC FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
150 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ, INC. FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
151 1125259 14365830 143658 CARNIVAL CORP CARNIVAL PLC FORMERLY CARNIVAL CRUISE LINES INC TO 04/18/1994 C
152 1125259 14365830 143658 CARNIVAL CORP P&O PRINCESS CRUISES PLC FORMERLY CARNIVAL CRUISE LINES INC TO 04/18/1994 C
153 1165002 96175104 961751 WESTWOOD FD INC WESTWOOD HOLDINGS GROUP INC NAME CHANGED TO EAST WEST GROWTH STK FD INC 12/27/1973 C
154 1175872 89462609 894626 TREASURE IS RTY TR TREASURE ISLAND ROYALTY TRUST NA NA C
155 1192549 38056012 380560 GOLD CITY INDS LTD GOLD CITY INDUSTRIES LTD NA NA C
156 1288359 68882302 688823 OTELCO INC OTELCO INC. NA NA C
157 1288359 68882302 688823 OTELCO INC RURAL LEC ACQUISITION LLC NA NA C
158 1310094 00430L10 00430L ACCENTIA BIOPHARMACEUTICALS ACCENTIA BIOPHARMACEUTICALS INC NA NA C
159 1314755 37250B14 37250B GENTIUM S P A GENTIUM S.P.A. NA NA C
160 1372414 00786106 007861 AEROPLAN INCOME FD AEROHIVE NETWORKS INC NA NA C
161 1372414 00786106 007861 AEROPLAN INCOME FD AEROHIVE NETWORKS, INC NA NA C
162 1382911 68616404 686164 OREXIGEN THERAPEUTICS INC OREXIGEN THERAPEUTICS, INC. NA NA C
163 1389168 87612G10 87612G TARGA RES CORP TARGA RESOURCES, INC. NA NA C
164 1389168 87612G10 87612G TARGA RES CORP TRI RESOURCES INC. NA NA C
165 1422109 09548108 095481 BLUE FISH ENTMT INC BLUE EARTH, INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
166 1422109 09548108 095481 BLUE FISH ENTMT INC CHERRY TANKERS INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
167 1422109 09548108 095481 BLUE FISH ENTMT INC CHERRY TANKERS INC NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
168 1422109 09548108 095481 BLUE FISH ENTMT INC GENESIS FLUID SOLUTIONS HOLDINGS, INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
169 1504937 92025W10 92025W VALOR GOLD CORP FELAFEL CORP NA NA C
170 1504937 92025W10 92025W VALOR GOLD CORP VALOR GOLD CORP. NA NA C
171 1504937 92025W10 92025W VALOR GOLD CORP VAPORIN, INC. NA NA C
bdcallen commented 4 years ago

@iangow

So I've organized some of the rows into soft matches, and rows where company_name and issuer_name_1 clearly do not match.

> View(match_issuers %>% filter(cik %in% soft_matches_i) %>% 
                   distinct(cik, cusip) %>% 
                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
                   left_join(issuers) %>% left_join(ciks) %>% 
                   distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type))
cik cusip cusip6 issuer_name_1 company_name issuer_adl_1 issuer_adl_2 issuer_type
1 215403 15505106 155051 CENTRAL RR CO N J CENTRAL RESERVE LIFE CORP REORGANIZED AS CENTRAL JERSEY INDS INC 08/15/1979 C
2 215403 15505106 155051 CENTRAL RR CO N J CERES GROUP INC REORGANIZED AS CENTRAL JERSEY INDS INC 08/15/1979 C
3 757011 90329405 903294 U S GULF HLDGS INC USG CORP NAME CHANGED TO CANCOM TECHNOLOGIES INC 04/29/1997 C
4 804138 02711807 027118 AMERICAN LD LEASE INC AMERICAN LAND LEASE INC NA NA C
5 804138 02711807 027118 AMERICAN LD LEASE INC ASSET INVESTORS CORP NA NA C
6 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA S A NAME CHANGED TO BBVA BANCO FRANCES S A 10/04/2000 C
7 842180 05959110 059591 BANCO FRANCES DEL RIO DE LA BANCO BILBAO VIZCAYA ARGENTARIA, S.A. NAME CHANGED TO BBVA BANCO FRANCES S A 10/04/2000 C
8 893813 70757610 707576 PENN PAC CORP PENN OCTANE CORP NA NA C
9 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC DE REORGANIZED AS CAPSULE COMMUNICATIONS INC 04/27/2000 C
10 1113227 90337P10 90337P US WATS INC CAPSULE COMMUNICATIONS INC REORGANIZED AS CAPSULE COMMUNICATIONS INC 04/27/2000 C
11 1125259 14365830 143658 CARNIVAL CORP CARNIVAL PLC FORMERLY CARNIVAL CRUISE LINES INC TO 04/18/1994 C
12 1125259 14365830 143658 CARNIVAL CORP P&O PRINCESS CRUISES PLC FORMERLY CARNIVAL CRUISE LINES INC TO 04/18/1994 C
13 1165002 96175104 961751 WESTWOOD FD INC WESTWOOD HOLDINGS GROUP INC NAME CHANGED TO EAST WEST GROWTH STK FD INC 12/27/1973 C
14 1175872 89462609 894626 TREASURE IS RTY TR TREASURE ISLAND ROYALTY TRUST NA NA C
15 1389168 87612G10 87612G TARGA RES CORP TARGA RESOURCES, INC. NA NA C
16 1389168 87612G10 87612G TARGA RES CORP TRI RESOURCES INC. NA NA C

Of these, I think AMERICAN LAND LEASE INC (CIK 804138), TREASURE ISLAND ROYALTY TRUST (CIK 1175872) are matches given the extensive acronymization in the issuer names, and CAPSULE COMMUNICATIONS INC DE (CIK 1113227) I would say is a match due to the issuer_adl columns, which seems to be columns which state some of the security's history. Also, looking here I would say BANCO BILBAO VIZCAYA ARGENTARIA S A (CIK 842180) is a match. So that's four additional matches. Below are the cases which, going be names and or history, do not seem to match at all.

> View(match_issuers %>% filter(cik %in% not_hard_matches_i & !(cik %in% soft_matches_i)) %>% 
                   distinct(cik, cusip) %>% 
                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
                   left_join(issuers) %>% left_join(ciks) %>% 
                   distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type))
cik cusip cusip6 issuer_name_1 company_name issuer_adl_1 issuer_adl_2 issuer_type
1 2491 36465410 364654 GAMING & TECHNOLOGY INC ALLIANCE GAMING CORP NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C
2 2491 36465410 364654 GAMING & TECHNOLOGY INC BALLY TECHNOLOGIES, INC. NAME CHANGED TO UNITED GAMING INC 12/08/1988      SEE 910475 C
3 16496 13152106 131521 CALVERT CASH RESVS CALPROP CORP NAME CHANGED TO MONEY MGMT PLUS 09/10/1984     SEE 609918 C
4 74783 68417600 684176 ORANGE-CO INC ACCEPTANCE INSURANCE COMPANIES INC NAME CHANGED TO STONERIDGE RES INC 04/22/1987      SEE 861839 C
5 99982 74100410 741004 PRESIDENTIAL RLTY CORP NEW TRUST CO OF NEW JERSEY NA NA C
6 99982 74100420 741004 PRESIDENTIAL RLTY CORP NEW TRUST CO OF NEW JERSEY NA NA C
7 105096 94356710 943567 WAVEPHORE INC WAXMAN INDUSTRIES INC NAME CHANGED TO WAVO CORP 06/23/1999          SEE 944027 C
8 724024 28882108 288821 ELLIS GABLES APT BLDG CORP AMERICAN PHYSICIANS SERVICE GROUP INC NA NA C
9 750556 22463310 224633 CRAWFORD & CO SUNTRUST BANKS INC NA NA C
10 750556 22463320 224633 CRAWFORD & CO SUNTRUST BANKS INC NA NA C
11 750556 69149730 691497 OXFORD INDS INC SUNTRUST BANKS INC NA NA C
12 750556 78125810 781258 RUDDICK CORP SUNTRUST BANKS INC NAME CHANGED TO HARRIS TEETER SUPERMARKETS INC 04/02/2012 C
13 750556 80653010 806530 SCHERER HEALTHCARE INC SUNTRUST BANKS INC ACQUIRED BY STERICYCLE INC 01/09/2003 C
14 750556 89235610 892356 TRACTOR SUPPLY CO SUNTRUST BANKS INC NA NA C
15 774487 03077110 030771 AMES W R CO AMERIWOOD INDUSTRIES INTERNATIONAL CORP ACQUIRED BY RUCKER CO NA C
16 781902 94091200 940912 WATERFORD CAP MGMT INC INTERNATIONAL BALER CORP REORGANIZED AS CPI PLASTICS GOUP LTD 09/21/1998 SEE 12615N C
17 781902 94091200 940912 WATERFORD CAP MGMT INC WASTE TECHNOLOGY CORP REORGANIZED AS CPI PLASTICS GOUP LTD 09/21/1998 SEE 12615N C
18 789879 29356300 293563 ENS BIO LOGICALS INC ENSURGE INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
19 789879 29356300 293563 ENS BIO LOGICALS INC ISHOPPER COM INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
20 789879 29356300 293563 ENS BIO LOGICALS INC SUNWALKER DEVELOPMENT INC NAME CHANGED TO ENSCOR INC 06/03/1987          SEE 29356L C
21 791050 02882810 028828 AMERICAN PASSAGE MKTG CORP PHOTOWORKS INC /WA NAME CHANGED TO SEATTLE FILMWORKS INC 02/24/1989 C
22 791050 02882810 028828 AMERICAN PASSAGE MKTG CORP SEATTLE FILMWORKS INC NAME CHANGED TO SEATTLE FILMWORKS INC 02/24/1989 C
23 811641 45252306 452523 IMMERSION HUMAN INTERFACE CORP IMMUCELL CORP /DE/ NA NA C
24 817135 08606103 086061 BERTUCCIS CORP ALLIANCE HEALTHCARE SERVICES, INC NA NA C
25 817135 08606103 086061 BERTUCCIS CORP ALLIANCE IMAGING INC /DE/ NA NA C
26 831532 45070710 450707 IVI PUBG INC ATRIX INTERNATIONAL INC REORGANIZED AS ONHEALTH NETWORK CO 06/16/1998 C
27 833203 24231700 242317 DEALS ARE GOOD INC LINKON CORP INC NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
28 833203 24231700 242317 DEALS ARE GOOD INC LINKON CORP NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
29 833203 24231700 242317 DEALS ARE GOOD INC PACKETPORT COM NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
30 833203 24231700 242317 DEALS ARE GOOD INC POCKETPORT COM NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
31 833203 24231700 242317 DEALS ARE GOOD INC WYNDSTORM CORP NAME CHANGED TO LINKON CORP 10/06/1990          SEE 535902 C
32 846615 46624B30 46624B JPY HLDGS LTD JPS INDUSTRIES INC NA NA C
33 846615 46624B30 46624B JPY HLDGS LTD JPS TEXTILE GROUP INC /DE/ NA NA C
34 849547 91826107 918261 VLPS LTG SVCS INTL INC BLACK BOX CORP NA NA C
35 849547 91826107 918261 VLPS LTG SVCS INTL INC MB HOLDINGS INC NA NA C
36 850693 18490102 184901 CLEARLY CDN BEVERAGE CORP ALLERGAN INC NA NA C
37 854873 00760G10 00760G AEGIS CONSUMER FDG GROUP INC RALLYS HAMBURGERS INC NA NA C
38 882235 48625010 486250 KAUFMANN FD INC LINCARE HOLDINGS INC NA NA C
39 888999 59114210 591142 METALCLAD CORP METRA BIOSYSTEMS INC NAME CHANGED TO XNTRX CORP 07/03/2002          SEE 293850 C
40 904081 63577110 635771 NATIONAL ED CORP STECK VAUGHN PUBLISHING CORP NA NA C
41 910468 48625010 486250 KAUFMANN FD INC FLOORING AMERICA INC NA NA C
42 910468 48625010 486250 KAUFMANN FD INC MAXIM GROUP INC / NA NA C
43 915197 43153101 431531 HILLCREST RES LTD HILITE INDUSTRIES INC NA NA C
44 928421 57723100 577231 MATZEL & MUMFORD MTG FDG INC MATTSON TECHNOLOGY INC NA NA C
45 1005757 12634109 126341 CS FIRST BOSTON INVT FDS INC CSG SYSTEMS INTERNATIONAL INC NAME CHANGED TO BEA INVT FDS INC 06/13/1995      SEE 05538D C
46 1017172 53226107 532261 LIGHTSHIP TANKERS IV LLC AUTHORIZE.NET HOLDINGS, INC. NA NA C
47 1017172 53226107 532261 LIGHTSHIP TANKERS IV LLC LIGHTBRIDGE INC NA NA C
48 1035991 36188107 361881 GMD BONDHOLDER TR FIRST ROBINSON FINANCIAL CORP NA NA C
49 1043044 50194100 501941 LDF INC LNR PROPERTY CORP NAME CHANGED TO FIRST CHICAGO BANCORP 11/22/2006  SEE 31948J C
50 1050122 68197104 681971 OMNI ADVANTAGE INC 1 800 CONTACTS INC NAME CHANGED TO GO CALL INC 03/16/1998          SEE 380163 C
51 1061763 44934106 449341 IC WKS INC DEL HUNGRY MINDS INC /DE/ NA NA C
52 1061763 44934106 449341 IC WKS INC DEL IDG BOOKS WORLDWIDE INC NA NA C
53 1065059 9033E108 9033E1 U S BK NATL ASSN MINNEAPOLIS CENTRUS ENERGY CORP NA NA C
54 1065059 9033E108 9033E1 U S BK NATL ASSN MINNEAPOLIS USEC INC NA NA C
55 1069258 33385515 333855 FIRST NATL BK SOUTHN MD KRATOS DEFENSE & SECURITY SOLUTIONS, INC. ACQUIRED BY FIRST MD BANCORP 08/15/1983 C
56 1069258 33385515 333855 FIRST NATL BK SOUTHN MD WIRELESS FACILITIES INC ACQUIRED BY FIRST MD BANCORP 08/15/1983 C
57 1372414 00786106 007861 AEROPLAN INCOME FD AEROHIVE NETWORKS INC NA NA C
58 1372414 00786106 007861 AEROPLAN INCOME FD AEROHIVE NETWORKS, INC NA NA C
59 1422109 09548108 095481 BLUE FISH ENTMT INC BLUE EARTH, INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
60 1422109 09548108 095481 BLUE FISH ENTMT INC CHERRY TANKERS INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
61 1422109 09548108 095481 BLUE FISH ENTMT INC CHERRY TANKERS INC NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
62 1422109 09548108 095481 BLUE FISH ENTMT INC GENESIS FLUID SOLUTIONS HOLDINGS, INC. NAME CHANGED TO PACIFIC GOLD CORP 09/08/2003     SEE 69433M C
bdcallen commented 4 years ago

And these are the remaining rows, which I deemed to be matches on first sight.

> View(match_issuers %>% filter(!cik %in% not_hard_matches_i) %>% 
                   distinct(cik, cusip) %>% 
                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
                   left_join(issuers) %>% left_join(ciks) %>% 
                   distinct(cik, cusip, cusip6, issuer_name_1, company_name, issuer_adl_1, issuer_adl_2, issuer_type))
cik cusip cusip6 issuer_name_1 company_name issuer_adl_1 issuer_adl_2 issuer_type
1 8038 04635208 046352 ASTRALIS LTD ASTREX INC FORMERLY ASTRALIS PHARMACEUTICALS LTD 12/10/2001 C
2 26537 23371210 233712 DAEDALUS ENTERPRISES INC ARGON ST, INC. NAME CHANGED TO SENSYS TECHNOLOGIES INC 06/09/1998 C
3 26537 23371210 233712 DAEDALUS ENTERPRISES INC DAEDALUS ENTERPRISES INC NAME CHANGED TO SENSYS TECHNOLOGIES INC 06/09/1998 C
4 26537 23371210 233712 DAEDALUS ENTERPRISES INC SENSYS TECHNOLOGIES INC NAME CHANGED TO SENSYS TECHNOLOGIES INC 06/09/1998 C
5 26537 23371210 233712 DAEDALUS ENTERPRISES INC SENSYTECH INC NAME CHANGED TO SENSYS TECHNOLOGIES INC 06/09/1998 C
6 33073 29403700 294037 ENVIRODYNE INDS INC ENVIRODYNE INDUSTRIES INC FORMERLY ENVIRODYNE INC TO 03/03/1978 AND NAME CHANGED TO C
7 33073 29403700 294037 ENVIRODYNE INDS INC VISKASE COMPANIES INC FORMERLY ENVIRODYNE INC TO 03/03/1978 AND NAME CHANGED TO C
8 38264 34986210 349862 FORWARD INDS INC N Y FORWARD INDUSTRIES INC NA NA C
9 38264 34986210 349862 FORWARD INDS INC N Y FORWARD INDUSTRIES, INC. NA NA C
10 45791 70355510 703555 PAUL HARRIS STORES INC HARRIS PAUL STORES INC NA NA C
11 45791 70355510 703555 PAUL HARRIS STORES INC PAUL HARRIS STORES INC NA NA C
12 57139 5024709C 502470 LA BARGE INC LABARGE INC NA NA C
13 59963 25474510 254745 DISTINCTIVE DEVICES INC DISTINCTIVE DEVICES INC NA NA C
14 65270 59152010 591520 METHODE ELECTRS INC METHODE ELECTRONICS INC NA NA C
15 69067 62632010 626320 MUNSINGWEAR INC MUNSINGWEAR INC NAME CHANGED TO PREMIUMWEAR INC 09/06/1996      SEE 740909 C
16 69067 62632010 626320 MUNSINGWEAR INC PREMIUMWEAR INC NAME CHANGED TO PREMIUMWEAR INC 09/06/1996      SEE 740909 C
17 70415 63713010 637130 NATIONAL PATENT DEV CORP GP STRATEGIES CORP NAME CHANGED TO GP STRATEGIES CORP 03/09/1998 SEE 36225V C
18 70415 63713010 637130 NATIONAL PATENT DEV CORP NATIONAL PATENT DEVELOPMENT CORP NAME CHANGED TO GP STRATEGIES CORP 03/09/1998 SEE 36225V C
19 72843 66526286 665262 NORTHERN IND PUB SVC CO NORTHERN INDIANA PUBLIC SERVICE CO NA NA C
20 74154 68267810 682678 ONEOK INC ONEOK INC REORGANIZED AS ONEOK INC NEW 11/28/1997 C
21 89140 81773232 817732 SERVOTRONICS INC SERVOTRONICS INC /DE/ NA NA C
22 97483 71891320 718913 PHOENIX RESOURCE COS INC PHOENIX RESOURCE COMPANIES INC NA NA C
23 100441 90212010 902120 TYCO INTL LTD TYCO INTERNATIONAL LTD FORMERLY TYCO LABS INC TO 11/10/1993 & REORGANIZED AS C
24 203248 84402810 844028 SOUTHERN UN CO SOUTHERN UNION CO NA NA C
25 216039 40009510 400095 GRUBB & ELLIS CO GRUBB & ELLIS CO NA NA C
26 319156 77742900 777429 ROSELAND OIL & GAS INC CUBIC ENERGY INC REORGANIZED AS CUBIC ENERGY INC  10/04/1999 SEE 229675 C
27 319156 77742900 777429 ROSELAND OIL & GAS INC CUBIC ENERGY LLC REORGANIZED AS CUBIC ENERGY INC  10/04/1999 SEE 229675 C
28 319156 77742900 777429 ROSELAND OIL & GAS INC HILLTOP ENERGY, LLC. REORGANIZED AS CUBIC ENERGY INC  10/04/1999 SEE 229675 C
29 319156 77742900 777429 ROSELAND OIL & GAS INC ROSELAND OIL & GAS INC REORGANIZED AS CUBIC ENERGY INC  10/04/1999 SEE 229675 C
30 704874 60520010 605200 MISSION WEST PPTYS MISSION WEST PROPERTIES/NEW/ REORGANIZED AS MISSION WEST PPTYS INC MD 12/30/1998 C
31 715355 40273410 402734 GULFSTREAM AEROSPACE CORP NEW GULFSTREAM AEROSPACE CORP NA NA C
32 719264 31946210 319462 FIRST CTZNS BANCSHARES INC FIRST CITIZENS BANCSHARES INC /TN/ NA NA C
33 720851 64107410 641074 NESTOR INC NESTOR INC NA NA C
34 721237 65488410 654884 NOBEL ED DYNAMICS INC NOBEL EDUCATION DYNAMICS INC NAME CHANGED TO NOBEL LEARNING CMNTYS INC 12/07/1998 C
35 721237 65488410 654884 NOBEL ED DYNAMICS INC NOBEL LEARNING COMMUNITIES INC NAME CHANGED TO NOBEL LEARNING CMNTYS INC 12/07/1998 C
36 723616 28511410 285114 ELECTRO KINETIC SYS INC CRYSTAL PROPERTIES HOLDINGS, INC. NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
37 723616 28511410 285114 ELECTRO KINETIC SYS INC ELECTRO KINETIC SYSTEMS INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
38 723616 28511410 285114 ELECTRO KINETIC SYS INC STERLING CAPITAL INVESTMENT GROUP INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
39 723616 28511410 285114 ELECTRO KINETIC SYS INC STERLING MEDIA CAPITAL GROUP INC NAME CHANGED TO STERLING MEDIA CAP GROUP INC 07/07/2000 C
40 729213 74791410 747914 Q MED INC Q MED INC NA NA C
41 729213 74791410 747914 Q MED INC QMED INC NA NA C
42 733289 37935510 379355 GLOBAL NAT RES INC GLOBAL NATURAL RESOURCES INC /NJ/ NA NA C
43 737033 74835710 748357 QUESTECH INC QUESTECH INC NA NA C
44 743872 43761400 437614 HOMESTAKE MNG CO HOMESTAKE MINING CO /DE/ NAME CHANGED TO BARRICK GOLD CORP 12/04/2001 SEE 067901 C
45 748212 73179110 731791 POLYPHASE CORP OVERHILL CORP NAME CHANGED TO OVERHILL CORP 03/20/2001   SEE 690211 C
46 748212 73179110 731791 POLYPHASE CORP POLYPHASE CORP NAME CHANGED TO OVERHILL CORP 03/20/2001   SEE 690211 C
47 748212 73179110 731791 POLYPHASE CORP TREECON RESOURCES INC NAME CHANGED TO OVERHILL CORP 03/20/2001   SEE 690211 C
48 761676 15091710 150917 CELLULAR COMMUNICATIONS INC CELLULAR COMMUNICATIONS INC MERGED INTO AIRTOUCH COMMUNICATIONS INC 08/16/1996 C
49 771498 07159310 071593 BATTLE MTN GOLD CO BATTLE MOUNTAIN GOLD CO NA NA C
50 777844 20992910 209929 COMPUSONICS VIDEO CORP COMPUSONICS VIDEO CORP NA NA C
51 787849 98502114 985021 YARDVILLE NATL BANCORP YARDVILLE NATIONAL BANCORP NA NA C
52 791164 42219B10 42219B HEALTH MGMT INC DEL HEALTH MANAGEMENT INC/DE NA NA C
53 791164 42219B10 42219B HEALTH MGMT INC DEL HOMECARE MANAGEMENT INC NA NA C
54 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS MAUNA LOA MACADAMIA PARTNERS LP NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
55 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS ML MACADAMIA ORCHARDS L P NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
56 792161 57751400 577514 MAUNA LOA MACADAMIA PARTNERS ROYAL HAWAIIAN ORCHARDS, L.P. NAME CHANGED TO ML MACADAMIA ORCHARDS L P 09/16/1998 C
57 804154 53680840 536808 LITHIUM TECHNOLOGY CORP LITHIUM TECHNOLOGY CORP NA NA C
58 811419 02355926 023559 AMERALIA INC AMERALIA INC NAME CHANGED TO NATURAL RES USA CORP  09/14/2010 C
59 811419 02355926 023559 AMERALIA INC NATURAL RESOURCES USA CORP NAME CHANGED TO NATURAL RES USA CORP  09/14/2010 C
60 812482 80190410 801904 SANTA FE GAMING CORP ARCHON CORP NAME CHANGED TO ARCHON CORP 05/11/2001          SEE 03957P C
61 812482 80190410 801904 SANTA FE GAMING CORP SAHARA GAMING CORP NAME CHANGED TO ARCHON CORP 05/11/2001          SEE 03957P C
62 812482 80190410 801904 SANTA FE GAMING CORP SANTA FE GAMING CORP NAME CHANGED TO ARCHON CORP 05/11/2001          SEE 03957P C
63 812906 40990017 409900 HANCOCK FABRICS INC HANCOCK FABRICS INC NA NA C
64 814577 02312010 023120 AMAX GOLD INC AMAX GOLD INC NAME CHANGED TO KINAM GOLD CORP 09/18/1998     SEE 494482 C
65 814577 02312010 023120 AMAX GOLD INC KINAM GOLD INC NAME CHANGED TO KINAM GOLD CORP 09/18/1998     SEE 494482 C
66 817644 86024C20 86024C STEVENS INTERNATIONAL INC STEVENS GRAPHICS CORP FORMERLY STEVENS GRAPHICS CORP 05/23/1995 C
67 817644 86024C20 86024C STEVENS INTERNATIONAL INC STEVENS INTERNATIONAL INC FORMERLY STEVENS GRAPHICS CORP 05/23/1995 C
68 829499 46072310 460723 INTERSTATE BAKERIES CORP NEW INTERSTATE BAKERIES CORP/DE/ MERGED INTO INTERSTATE BRANDS CORP 06/02/1990 C
69 840260 53803230 538032 LIVE ENTMT INC LIVE ENTERTAINMENT INC NAME CHANGED TO CANCER TREATMENT CTRS INC 07/30/2002 C
70 841528 75952U10 75952U RELIASTAR FINL CORP NWNL COMPANIES INC MERGED INTO ING GROEP N.V. 09/01/2000 C
71 841528 75952U10 75952U RELIASTAR FINL CORP RELIASTAR FINANCIAL CORP MERGED INTO ING GROEP N.V. 09/01/2000 C
72 845609 38070110 380701 GOLD RIV HOTEL & CASINO CORP GOLD RIVER HOTEL & CASINO CORP NA NA C
73 846609 68230310 682303 ONBANCORP INC ONBANCORP INC NA NA C
74 849502 75190710 751907 RAMTRON INTL CORP RAMTRON INTERNATIONAL CORP NA NA C
75 858339 74342A10 74342A PROMUS COS INC CAESARS ENTERTAINMENT CORP NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
76 858339 74342A10 74342A PROMUS COS INC HARRAHS ENTERTAINMENT INC NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
77 858339 74342A10 74342A PROMUS COS INC PROMUS COMPANIES INC NAME CHANGED TO HARRAHS ENTMT INC 06/30/1995      SEE 413619 C
78 862025 90337P10 90337P US WATS INC US WATS INC REORGANIZED AS CAPSULE COMMUNICATIONS INC 04/27/2000 C
79 869709 26831802 268318 EDG CAP INC EDG CAPITAL INC NAME CHANGED TO ISOTOPE SOLUTIONS GROUP INC 11/14/2001 C
80 869709 26831802 268318 EDG CAP INC ISOTOPE SOLUTIONS GROUP INC NAME CHANGED TO ISOTOPE SOLUTIONS GROUP INC 11/14/2001 C
81 872248 68393020 683930 OPTIMA PETE CORP OPTIMA PETROLEUM CORP REORGANIZED AS PETROQUEST ENERGY INC 09/04/1998 C
82 872248 68393020 683930 OPTIMA PETE CORP PETROQUEST ENERGY INC REORGANIZED AS PETROQUEST ENERGY INC 09/04/1998 C
83 874737 59512710 595127 MICRONICS COMPUTERS INC MICRONICS COMPUTERS INC /CA NA NA C
84 875459 92656M10 92656M VIDEO LOTTERY TECHNOLOGIES INC POWERHOUSE TECHNOLOGIES INC /DE NAME CHANGED TO POWERHOUSE TECHNOLOGIES INC C
85 875459 92656M10 92656M VIDEO LOTTERY TECHNOLOGIES INC VIDEO LOTTERY TECHNOLOGIES INC/DE NAME CHANGED TO POWERHOUSE TECHNOLOGIES INC C
86 875622 09093116 090931 BIOSPECIFICS TECHNOLOGIES CORP BIOSPECIFICS TECHNOLOGIES CORP NA NA C
87 879585 04907910 049079 ATLANTIC TELE-NETWORK INC ATLANTIC TELE NETWORK INC /DE NA NA C
88 879585 04907910 049079 ATLANTIC TELE-NETWORK INC ATN INTERNATIONAL, INC. NA NA C
89 880320 89472901 894729 TREMONT ADVISERS INC TREMONT ADVISERS INC ACQUIRED BY PRIVATE INVESTORS 10/01/2001 C
90 882159 23922010 239220 DAW TECHNOLOGIES INC DAW TECHNOLOGIES INC /UT NA NA C
91 882264 03852L10 03852L ARAKIS ENERGY CORP ARAKIS ENERGY CORP FORMERLY ARAKIS MNG CORP TO 09/12/1988 & FORMERLY ARAKIS C
92 882291 00808Y09 00808Y AETHLON MED INC AETHLON MEDICAL INC NA NA C
93 882291 00808Y09 00808Y AETHLON MED INC BISHOP EQUITIES INC NA NA C
94 884802 91382810 913828 UNIVERSAL SEISMIC ASSOC INC UNIVERSAL SEISMIC ASSOCIATES INC NAME CHANGED TO POCKETOP CORP 11/14/2006          SEE 73035T C
95 885942 84476910 844769 SOUTHWEST BANCSHARES INC DEL SOUTHWEST BANCSHARES INC /NEW/ NA NA C
96 887353 81804310 818043 SEVEN-UP / RC BOTTLING CO SEVEN UP RC BOTTLING COMPANY OF SOUTHERN CALIFORNIA INC NA NA C
97 888470 890910AB 890910 TOPS APPLIANCE CITY INC TOPS APPLIANCE CITY INC NA NA C
98 890568 69340T10 69340T PLD TELEKOM INC PETERSBURG LONG DISTANCE INC REORGANIZED AS METROMEDIA INTL GROUP INC 09/30/1999 C
99 890568 69340T10 69340T PLD TELEKOM INC PLD TELEKOM INC REORGANIZED AS METROMEDIA INTL GROUP INC 09/30/1999 C
100 895422 74037Q10 74037Q PREFERRED INCOME MGMT FD INC BOULDER TOTAL RETURN FUND INC NAME CHANGED TO BOULDER TOTAL RETURN FD INC 08/27/1999 C
101 895422 74037Q10 74037Q PREFERRED INCOME MGMT FD INC PREFERRED INCOME MANAGEMENT FUND INC NAME CHANGED TO BOULDER TOTAL RETURN FD INC 08/27/1999 C
102 897448 02311107 023111 AMARIN CORP PLC AMARIN CORP PLC\UK NA NA C
103 897448 02311107 023111 AMARIN CORP PLC AMARIN PHARMACEUTICALS PLC NA NA C
104 897448 02311107 023111 AMARIN CORP PLC ETHICAL HOLDINGS PLC NA NA C
105 904900 08372L10 08372L BERG ELECTRS CORP BERG ELECTRONICS CORP /DE/ NA NA C
106 907111 36870710 368707 GENEMEDICINE INC GENEMEDICINE INC NA NA C
107 908598 00817R03 00817R AETRIUM INC AETRIUM INC NA NA C
108 908598 00817R03 00817R AETRIUM INC ATRM HOLDINGS, INC. NA NA C
109 919239 38095610 380956 GOLDCORP INC NEW GOLDCORP INC NA NA C
110 924835 88871110 888711 TIVOLI INDS INC TIVOLI INDUSTRIES INC NA NA C
111 928375 98159L10 98159L WORLDWIDE PETROMOLY INC DUTCH GOLD RESOURCES INC REORGANIZED AS SMALL TOWN RADIO INC 05/29/2002 C
112 928375 98159L10 98159L WORLDWIDE PETROMOLY INC OGDEN MCDONALD & CO REORGANIZED AS SMALL TOWN RADIO INC 05/29/2002 C
113 928375 98159L10 98159L WORLDWIDE PETROMOLY INC SMALL TOWN RADIO INC REORGANIZED AS SMALL TOWN RADIO INC 05/29/2002 C
114 928375 98159L10 98159L WORLDWIDE PETROMOLY INC WORLDWIDE PETROMOLY INC REORGANIZED AS SMALL TOWN RADIO INC 05/29/2002 C
115 932352 91913E02 91913E VALENTIS INC MEGABIOS CORP NAME CHANGED TO URIGEN PHARMACEUTICALS INC 07/19/2007 C
116 932352 91913E02 91913E VALENTIS INC URIGEN PHARMACEUTICALS, INC. NAME CHANGED TO URIGEN PHARMACEUTICALS INC 07/19/2007 C
117 932352 91913E02 91913E VALENTIS INC VALENTIS INC NAME CHANGED TO URIGEN PHARMACEUTICALS INC 07/19/2007 C
118 933157 78069310 780693 ROYAL SILVER MINES INC AURORA OIL & GAS CORP NAME CHANGED TO CADENCE RES CORP 05/02/2001     SEE 12738N C
119 933157 78069310 780693 ROYAL SILVER MINES INC CADENCE RESOURCES CORP LP NAME CHANGED TO CADENCE RES CORP 05/02/2001     SEE 12738N C
120 933157 78069310 780693 ROYAL SILVER MINES INC CADENCE RESOURCES CORP NAME CHANGED TO CADENCE RES CORP 05/02/2001     SEE 12738N C
121 933157 78069310 780693 ROYAL SILVER MINES INC ROYAL SILVER MINES INC NAME CHANGED TO CADENCE RES CORP 05/02/2001     SEE 12738N C
122 937252 14166N20 14166N CAREADVANTAGE INC CAREADVANTAGE INC NA NA C
123 943324 20741M03 20741M CONMED HEALTHCARE MGMT INC CONMED HEALTHCARE MANAGEMENT, INC. NA NA C
124 943324 20741M03 20741M CONMED HEALTHCARE MGMT INC PACE HEALTH MANAGEMENT SYSTEMS INC NA NA C
125 946661 37989P10 37989P GLYCOGENESYS INC GLYCOGENESY  INC NA NA C
126 946661 37989P10 37989P GLYCOGENESYS INC GLYCOGENESYS  INC NA NA C
127 946661 37989P10 37989P GLYCOGENESYS INC IGG INTERNATIONAL INC NA NA C
128 946661 37989P10 37989P GLYCOGENESYS INC SAFESCIENCE INC NA NA C
129 946842 48625010 486250 KAUFMANN FD INC AMBASSADORS INTERNATIONAL INC NA NA C
130 948642 87927W10 87927W TELECOM ITALIA SPA TELECOM ITALIA S P A REORGANIZED AS TELECOM ITALIA S P A NEW 04/08/2003 C
131 948642 87927W10 87927W TELECOM ITALIA SPA TIM S.P.A. REORGANIZED AS TELECOM ITALIA S P A NEW 04/08/2003 C
132 948850 92330P10 92330P VENTURI PARTNERS INC COMSYS IT PARTNERS INC NAME CHANGED TO COMSYS IT PARTNERS INC 09/30/2004 C
133 948850 92330P10 92330P VENTURI PARTNERS INC PERSONNEL GROUP OF AMERICA INC NAME CHANGED TO COMSYS IT PARTNERS INC 09/30/2004 C
134 948850 92330P10 92330P VENTURI PARTNERS INC VENTURI PARTNERS INC NAME CHANGED TO COMSYS IT PARTNERS INC 09/30/2004 C
135 1002658 23437N10 23437N DALEEN TECHNOLOGIES INC DALEEN TECHNOLOGIES INC NA NA C
136 1006835 82617012 826170 SIEBEL SYS INC SIEBEL SYSTEMS INC REORGANIZED AS SIEBEL JANNA ARRANGEMENT INC 11/14/2000 C
137 1013050 74257410 742574 PRINTRAK INTL INC PRINTRAK INTERNATIONAL INC NA NA C
138 1014383 68340210 683402 ONYX SOFTWARE CORP ONYX SOFTWARE CORP/WA NA NA C
139 1032462 90297110 902971 U S RESTAURANT PPTYS INC FF-TSY HOLDING CO II, LLC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
140 1032462 90297110 902971 U S RESTAURANT PPTYS INC TRUSTREET PROPERTIES INC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
141 1032462 90297110 902971 U S RESTAURANT PPTYS INC U S RESTAURANT PROPERTIES INC REORGANIZED AS TRUSTSTREET PPTYS INC 02/24/2005 C
142 1033030 29879610 298796 EUROTECH LTD EUROTECH LTD NA NA C
143 1039684 68267810 682678 ONEOK INC ONEOK INC /NEW/ REORGANIZED AS ONEOK INC NEW 11/28/1997 C
144 1039684 68267810 682678 ONEOK INC WAI INC REORGANIZED AS ONEOK INC NEW 11/28/1997 C
145 1041858 74936101 749361 RCN CORP RCN CORP /DE/ NA NA C
146 1042618 65365610 653656 NICE SYS LTD NICE SYSTEMS LTD  /ADR/ NA NA C
147 1044979 45004F10 45004F IRI INTL CORP IRI INTERNATIONAL CORP NA NA C
148 1064133 98144Q10 98144Q WORLD COMM ONLINE INC DEL WORLD COMMERCE ONLINE INC NA NA C
149 1075046 89732918 897329 TROY FINL CORP TROY FINANCIAL CORP NA NA C
150 1093367 02932810 029328 AMERICAN RIVS OIL CO AMERICAN RIVERS OIL CO /DE/ REORGANIZED AS AROC INC 12/08/1999 C
151 1093367 02932810 029328 AMERICAN RIVS OIL CO AROC INC REORGANIZED AS AROC INC 12/08/1999 C
152 1094348 29015110 290151 ELOYALTY CORP ELOYALTY CORP NAME CHANGED TO MATTERSIGHT CORP 05/31/2011 SEE 577097 C
153 1094348 29015110 290151 ELOYALTY CORP MATTERSIGHT CORP NAME CHANGED TO MATTERSIGHT CORP 05/31/2011 SEE 577097 C
154 1105503 52886Q10 52886Q LEXENT INC LEXENT INC NA NA C
155 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ OMX GROUP, INC. FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
156 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ STOCK MARKET INC FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
157 1120193 63110318 631103 NASDAQ OMX GROUP INC NASDAQ, INC. FORMERLY NASDAQ STK MKT INC TO 02/27/2008 C
158 1156388 08915P01 08915P BIG 5 SPORTING GOODS CORP BIG 5 HOLDINGS CORP NA NA C
159 1156388 08915P01 08915P BIG 5 SPORTING GOODS CORP BIG 5 SPORTING GOODS CORP NA NA C
160 1177326 55377820 553778 MT ULTIMATE HEALTHCARE CORP JAVA JUICE NET NAME CHANGED TO MONARCH STAFFING INC  03/31/2006 C
161 1177326 55377820 553778 MT ULTIMATE HEALTHCARE CORP MONARCH STAFFING, INC. NAME CHANGED TO MONARCH STAFFING INC  03/31/2006 C
162 1177326 55377820 553778 MT ULTIMATE HEALTHCARE CORP MT ULTIMATE HEALTHCARE CORP NAME CHANGED TO MONARCH STAFFING INC  03/31/2006 C
163 1192549 38056012 380560 GOLD CITY INDS LTD GOLD CITY INDUSTRIES LTD NA NA C
164 1262279 32077108 320771 FIRST MARBLEHEAD CORP FIRST MARBLEHEAD CORP NA NA C
165 1288359 68882302 688823 OTELCO INC OTELCO INC. NA NA C
166 1288359 68882302 688823 OTELCO INC RURAL LEC ACQUISITION LLC NA NA C
167 1310094 00430L10 00430L ACCENTIA BIOPHARMACEUTICALS ACCENTIA BIOPHARMACEUTICALS INC NA NA C
168 1314755 37250B14 37250B GENTIUM S P A GENTIUM S.P.A. NA NA C
169 1382911 68616404 686164 OREXIGEN THERAPEUTICS INC OREXIGEN THERAPEUTICS, INC. NA NA C
170 1504937 92025W10 92025W VALOR GOLD CORP FELAFEL CORP NA NA C
171 1504937 92025W10 92025W VALOR GOLD CORP VALOR GOLD CORP. NA NA C
172 1504937 92025W10 92025W VALOR GOLD CORP VAPORIN, INC. NA NA C
173 1547530 02462418 024624 AMERICAN BOARDING CO AMERICAN BOARDING CO NA NA C
174 1547530 02462418 024624 AMERICAN BOARDING CO MICROLIN BIO, INC NA NA C
bdcallen commented 4 years ago

@iangow

> match_issuers %>% filter(cik %in% not_hard_matches_i & !(cik %in% soft_matches_i)) %>% 
+                   distinct(cik, cusip) %>% 
+                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
+                   left_join(issuers) %>% left_join(ciks) %>% 
+                   distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 1
      n
  <int>
1    41

> match_issuers %>% filter(cik %in% soft_matches_i) %>% 
+                   distinct(cik, cusip) %>% 
+                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
+                   left_join(issuers) %>% left_join(ciks) %>% 
+                   distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 1
      n
  <int>
1    10

> match_issuers %>% filter(!cik %in% not_hard_matches_i) %>% 
+                   distinct(cik, cusip) %>% 
+                   mutate(cusip6 = substr(cusip, 1, 6)) %>% 
+                   left_join(issuers) %>% left_join(ciks) %>% 
+                   distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 1
      n
  <int>
1   106

> match_stocknames %>% anti_join(match_issuers) %>% distinct(cik, cusip) %>% 
+                        mutate(ncusip = cusip) %>% 
+                        left_join(stocknames, by = 'ncusip') %>% left_join(ciks) %>% select(cik, cusip.x, company_name, comnam) %>% rename(cusip = cusip.x)
Joining, by = c("cusip", "cik")
Joining, by = "cik"
# A tibble: 3 x 4
      cik cusip    company_name         comnam                
    <int> <chr>    <chr>                <chr>                 
1  874516 M2051410 BVR TECHNOLOGIES LTD B V R TECHNOLOGIES LTD
2  874516 M2051410 BVR TECHNOLOGIES LTD B V R TECHNOLOGIES LTD
3 1078276 G6116R10 MIH LTD              M I H LTD TORTOLA     

So we have 10 matches in the "soft match set", 4 of which I determined in a post above that are actually matches, 41 which are not matches by company name or history, and 106 which are matches. Also, out of the two matches in match_stocknames which are not also in match_issuers, is seems both are legitimate matches. So out of the set of 335 8 digit cusip matches with a multiplicity more than 10, 159 matches join onto either crsp.stocknames or cusipm.issuer. Out of these 159, 112 can be said to be good matches, with another 6 with a question mark.

Then there are 41 that seem to be bad matches, though it is possible that these entries correspond to older cusips and/or company names, which might not be in either of crsp.stocknames or cusipm.issuer (this is why I'd like to know how comprehensive and up to date these are). Are there any other databases that these could be joined onto?

So it seems, if the mentioned databases are comprehensive, then 335 - 159 = 176 of the matches have cusips which do not exist in any major database, and are probably harmless (as they won't join). Then 112 of the remaining 159 are good matches, 6 might be good matches, and 41 are bad matches which could cause harm (as the wrong cusip is potentially matched).

bdcallen commented 4 years ago

@iangow Some of the 8-digit cusips seemed to be "padded"

> cusip8s %>% anti_join(eight_w_valid9) %>% filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% filter(grepl('(SEECOVER|SCHEDULE|PAGE[0-9]OF[0-9])', cusip))
Joining, by = c("cik", "cusip", "multiplicity")
# A tibble: 20 x 2
   cusip        cik
   <chr>      <int>
 1 SEECOVER    6292
 2 PAGE2OF4    6715
 3 SEECOVER   42582
 4 SEECOVER   67887
 5 SEECOVER  310056
 6 SEECOVER  351998
 7 SEECOVER  731657
 8 SEECOVER  864683
 9 SEECOVER  906347
10 SEECOVER  922204
11 SCHEDULE  943861
12 SEECOVER  944075
13 SCHEDULE 1016100
14 SEECOVER 1043961
15 SEECOVER 1051514
16 SEECOVER 1053374
17 PAGE2OF2 1061069
18 SEECOVER 1090431
19 SCHEDULE 1157624
20 SEECOVER 1239819

cases_8 <- cusip8s %>% anti_join(eight_w_valid9) %>% filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% filter(!grepl('(SEECOVER|SCHEDULE|PAGE[0-9]OF[0-9])', cusip))

cases_8 <- cases_8 %>% mutate(cusip_padded = paste0('0', cusip))
cases_8 <- cases_8 %>% mutate(cusip6 = substr(cusip_padded, 1, 6), ncusip = substr(cusip_padded, 1, 8))

cases_8 <- cases_8 %>% select(cik, cusip, cusip_padded, ncusip, cusip6)

> cases_8
# A tibble: 315 x 5
     cik cusip    cusip_padded ncusip   cusip6
   <int> <chr>    <chr>        <chr>    <chr> 
 1  2491 36465410 036465410    03646541 036465
 2  3545 01630104 001630104    00163010 001630
 3  5133 02375204 002375204    00237520 002375
 4  6281 32654105 032654105    03265410 032654
 5  8038 04635208 004635208    00463520 004635
 6  9346 57755100 057755100    05775510 057755
 7 10119 68509603 068509603    06850960 068509
 8 11544 84423102 084423102    08442310 084423
 9 12208 90572207 090572207    09057220 090572
10 16496 13152106 013152106    01315210 013152
# … with 305 more rows

> cases_8 %>% inner_join(stocknames, by = 'ncusip') %>% left_join(ciks, by = 'cik') %>% select(cik, cusip.x, cusip_padded, ncusip, comnam, company_name) %>% rename(cusip = cusip.x)
# A tibble: 88 x 6
     cik cusip    cusip_padded ncusip   comnam                     company_name              
   <int> <chr>    <chr>        <chr>    <chr>                      <chr>                     
 1  6281 32654105 032654105    03265410 ANALOG DEVICES INC         ANALOG DEVICES INC        
 2  6281 32654105 032654105    03265410 ANALOG DEVICES INC         ANALOG DEVICES INC        
 3  6281 32654105 032654105    03265410 ANALOG DEVICES INC         ANALOG DEVICES INC        
 4  9346 57755100 057755100    05775510 BALDWIN & LYONS INC        BALDWIN & LYONS INC       
 5  9346 57755100 057755100    05775510 BALDWIN & LYONS INC        PROTECTIVE INSURANCE CORP 
 6  9346 57755100 057755100    05775510 BALDWIN & LYONS INC        BALDWIN & LYONS INC       
 7  9346 57755100 057755100    05775510 BALDWIN & LYONS INC        PROTECTIVE INSURANCE CORP 
 8 10119 68509603 068509603    06850960 BARRINGER TECHNOLOGIES INC BARRINGER TECHNOLOGIES INC
 9 11544 84423102 084423102    08442310 BERKLEY W R CORP           BERKLEY W R CORP          
10 11544 84423102 084423102    08442310 BERKLEY W R CORP           BERKLEY W R CORP          
# … with 78 more rows

> cases_8 %>% inner_join(issuers) %>% left_join(ciks, by = 'cik') %>% select(cik, cusip, cusip_padded, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_adl_2, )
Joining, by = "cusip6"
# A tibble: 128 x 8
     cik cusip    cusip_padded cusip6 company_name               issuer_name_1              issuer_adl_1                  issuer_adl_2
   <int> <chr>    <chr>        <chr>  <chr>                      <chr>                      <chr>                         <chr>       
 1  3545 01630104 001630104    001630 ALICO INC                  ALR TECHNOLOGIES INC       NA                            NA          
 2  3545 01630104 001630104    001630 ALICO, INC.                ALR TECHNOLOGIES INC       NA                            NA          
 3  6281 32654105 032654105    032654 ANALOG DEVICES INC         ANALOG DEVICES INC         NA                            NA          
 4  9346 57755100 057755100    057755 BALDWIN & LYONS INC        BALDWIN & LYONS INC        NA                            NA          
 5  9346 57755100 057755100    057755 PROTECTIVE INSURANCE CORP  BALDWIN & LYONS INC        NA                            NA          
 6 10119 68509603 068509603    068509 BARRINGER TECHNOLOGIES INC BARRINGER TECHNOLOGIES INC FORMERLY BARRINGER RES INC TO 02/12/1991  
 7 11544 84423102 084423102    084423 BERKLEY W R CORP           BERKLEY W R CORP           NA                            NA          
 8 12208 90572207 090572207    090572 BIO RAD LABORATORIES INC   BIO RAD LABORATORIES INC   FORMERLY BIO RAD LABS TO      04/25/1975  
 9 12208 90572207 090572207    090572 BIO-RAD LABORATORIES, INC. BIO RAD LABORATORIES INC   FORMERLY BIO RAD LABS TO      04/25/1975  
10 26537 23371210 023371210    023371 ARGON ST, INC.             AMCAL INC                  NA                            NA          
# … with 118 more rows
bdcallen commented 4 years ago
> cases_8 %>% inner_join(issuers) %>% left_join(ciks, by = 'cik') %>% distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
# A tibble: 1 x 1
      n
  <int>
1    78
> cases_8 %>% inner_join(stocknames, by = 'ncusip') %>% left_join(ciks, by = 'cik') %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    38
> cases_8 %>% inner_join(stocknames, by = 'ncusip') %>% left_join(ciks, by = 'cik') %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) %>% anti_join(cases_8 %>% inner_join(issuers) %>% left_join(ciks, by = 'cik') %>% distinct(cik, cusip))
Joining, by = "cusip6"
Joining, by = c("cik", "cusip")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, cusip <chr>

There seem to be 78 candidates which could potentially be matches if we left-pad with a zero.

bdcallen commented 4 years ago

Obviously, it might be a good idea to have a function in R which calculates the check digit reliably, so that we can further test the padded cusips

get_cusip_digit_value <- function(cusip_digit) {

    digits <- c('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 
                     'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 
                     'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '*', '@', '#')

    result = which(digits == cusip_digit) - 1

    return(result)

}

get_cusip_digit_value <- Vectorize(get_cusip_digit_value)

calculate_cusip_check_digit <- function(cusip) {

    digit_str <- paste0(as.character(get_cusip_digit_value(
                 unlist(str_split(substr(cusip, 1, 8), ''))) * c(1, 2, 1, 2, 1, 2, 1, 2)))

    digits_to_sum <- as.integer(unlist(str_split(digit_str, '')))

    result <- (10 - sum(digits_to_sum)) %% 10

    return(result)

}

calculate_cusip_check_digit <- Vectorize(calculate_cusip_check_digit)

> cusip_cik %>% filter(cusip_length == 9) %>% 
   distinct(cik, cusip, check_digit) %>% 
   mutate(test = calculate_cusip_check_digit(cusip))
# A tibble: 49,691 x 4
     cik cusip     check_digit  test
   <int> <chr>           <int> <dbl>
 1    20 082073107           7     7
 2    20 320195100           0     0
 3    20 482730108           8     8
 4  1750 000361105           5     5
 5  1750 232109108           8     8
 6  1750 36110513G           3     3
 7  1761 894120104           4     4
 8  1761 894120203           3     3
 9  1800 002824100           0     0
10  1853 003050101           1     1
# … with 49,681 more rows
> cusip_cik %>% filter(cusip_length == 9) %>% 
   distinct(cik, cusip, check_digit) %>% 
   mutate(test = calculate_cusip_check_digit(cusip)) %>% 
   filter(check_digit == test) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 49691
bdcallen commented 4 years ago
> cases_8 %>% inner_join(issuers) %>% 
   left_join(ciks, by = 'cik') %>% 
  distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
# A tibble: 1 x 1
      n
  <int>
1    78
> cases_8 %>% inner_join(issuers) %>% 
   left_join(ciks, by = 'cik') %>% 
   mutate(check_digit = calculate_cusip_check_digit(cusip_padded)) %>% 
   filter(substr(cusip_padded, 9, 9) == as.character(check_digit)) %>% 
   distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
# A tibble: 1 x 1
      n
  <int>
1    43

So there are still 43 potential cases for padded cusips, once check digit contraint is applied.

bdcallen commented 4 years ago
> cases_8 <- cases_8 %>% mutate(check_digit = calculate_cusip_check_digit(cusip_padded))
> cases_8
# A tibble: 315 x 6
     cik cusip    cusip_padded ncusip   cusip6 check_digit
   <int> <chr>    <chr>        <chr>    <chr>        <dbl>
 1  2491 36465410 036465410    03646541 036465           7
 2  3545 01630104 001630104    00163010 001630           2
 3  5133 02375204 002375204    00237520 002375           2
 4  6281 32654105 032654105    03265410 032654           5
 5  8038 04635208 004635208    00463520 004635           7
 6  9346 57755100 057755100    05775510 057755           0
 7 10119 68509603 068509603    06850960 068509           3
 8 11544 84423102 084423102    08442310 084423           2
 9 12208 90572207 090572207    09057220 090572           7
10 16496 13152106 013152106    01315210 013152           3
# … with 305 more rows
> cases_8 %>% filter(substr(cusip_padded, 9, 9) == as.character(check_digit))
# A tibble: 69 x 6
      cik cusip    cusip_padded ncusip   cusip6 check_digit
    <int> <chr>    <chr>        <chr>    <chr>        <dbl>
 1   6281 32654105 032654105    03265410 032654           5
 2   9346 57755100 057755100    05775510 057755           0
 3  10119 68509603 068509603    06850960 068509           3
 4  11544 84423102 084423102    08442310 084423           2
 5  12208 90572207 090572207    09057220 090572           7
 6  21610 75907105 075907105    07590710 075907           5
 7  46195 62540109 062540109    06254010 062540           9
 8  70415 63713010 063713010    06371301 063713           0
 9  72843 66526286 066526286    06652628 066526           6
10 100441 90212010 090212010    09021201 090212           0
# … with 59 more rows

So there are 69 8-digit cusips which are potentially 9-digit cusips when padded with a 0 from the left. And 43 of these map onto either stocknames or issuers (note: I have already established that all potential matches to stocknames also match onto issuers in this set in a previous post).

bdcallen commented 4 years ago
View(cases_8 %>% inner_join(issuers) %>% 
            left_join(ciks, by = 'cik') %>% 
            filter(substr(cusip_padded, 9, 9) == as.character(check_digit)) %>% 
            distinct(cik, cusip, cusip_padded, cusip6, check_digit, company_name, 
                         issuer_name_1, issuer_name_2, issuer_name_3, issuer_adl_1)

gives

cik cusip cusip_padded cusip6 check_digit company_name issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1
1 6281 32654105 032654105 032654 5 ANALOG DEVICES INC ANALOG DEVICES INC NA NA NA
2 9346 57755100 057755100 057755 0 BALDWIN & LYONS INC BALDWIN & LYONS INC NA NA NA
3 9346 57755100 057755100 057755 0 PROTECTIVE INSURANCE CORP BALDWIN & LYONS INC NA NA NA
4 10119 68509603 068509603 068509 3 BARRINGER TECHNOLOGIES INC BARRINGER TECHNOLOGIES INC NA NA FORMERLY BARRINGER RES INC TO
5 11544 84423102 084423102 084423 2 BERKLEY W R CORP BERKLEY W R CORP NA NA NA
6 12208 90572207 090572207 090572 7 BIO RAD LABORATORIES INC BIO RAD LABORATORIES INC NA NA FORMERLY BIO RAD LABS TO
7 12208 90572207 090572207 090572 7 BIO-RAD LABORATORIES, INC. BIO RAD LABORATORIES INC NA NA FORMERLY BIO RAD LABS TO
8 46195 62540109 062540109 062540 9 BANCORP HAWAII INC BANK HAWAII CORP NA NA NA
9 46195 62540109 062540109 062540 9 BANK OF HAWAII CORP BANK HAWAII CORP NA NA NA
10 46195 62540109 062540109 062540 9 PACIFIC CENTURY FINANCIAL CORP BANK HAWAII CORP NA NA NA
11 206030 54615109 054615109 054615 9 AXSYS TECHNOLOGIES INC AXSYS TECHNOLOGIES INC NA NA NA
12 206030 54615109 054615109 054615 9 VERNITRON CORP AXSYS TECHNOLOGIES INC NA NA NA
13 718976 55442107 055442107 055442 7 BGS SYSTEMS INC BGS SYS INC NA NA NA
14 724024 28882108 028882108 028882 8 AMERICAN PHYSICIANS SERVICE GROUP INC AMERICAN PHYSICIANS SVC GROUP INC NA NA
15 750556 69149730 069149730 069149 0 SUNTRUST BANKS INC BARTELL MEDIA CORP NA NA MERGED INTO DOWNE
16 771497 00957100 000957100 000957 0 ABM INDUSTRIES INC /DE/ ABM INDS INC NA NA NA
17 771497 00957100 000957100 000957 0 ABM INDUSTRIES INC ABM INDS INC NA NA NA
18 771497 00957100 000957100 000957 0 AMERICAN BUILDING MAINTENANCE INDUSTRIES INC ABM INDS INC NA NA NA
19 796486 06848105 006848105 006848 5 ADELPHIA COMMUNICATIONS CORP ADELPHIA COMMUNICATIONS CORP NA NA NA
20 799233 42234104 042234104 042234 4 HEARTLAND EXPRESS INC ARMISTICE RES CORP NA NA NAME CHANGED TO KERR MINES INC
21 814580 0753T105 00753T105 00753T 5 ADVANCED MARKETING SERVICES INC ADVANCED MARKETING SVCS INC NA NA NA
22 818813 91736108 091736108 091736 8 BITSTREAM INC. BITSTREAM INC NA NA NA
23 818813 91736108 091736108 091736 8 BITSTREAM INC BITSTREAM INC NA NA NA
24 830736 90649104 090649104 090649 4 ALPHA INNOTECH CORP BIOPOOL INTL INC NA NA NAME CHANGED TO XTRANA INC
25 830736 90649104 090649104 090649 4 BIOPOOL INTERNATIONAL INC BIOPOOL INTL INC NA NA NAME CHANGED TO XTRANA INC
26 830736 90649104 090649104 090649 4 XTRANA INC BIOPOOL INTL INC NA NA NAME CHANGED TO XTRANA INC
27 834306 90945106 090945106 090945 6 ALERE SAN DIEGO, INC. BIOSITE INC NA NA FORMERLY BIOSITE DIAGNOSTICS
28 834306 90945106 090945106 090945 6 BIOSITE DIAGNOSTICS INC BIOSITE INC NA NA FORMERLY BIOSITE DIAGNOSTICS
29 834306 90945106 090945106 090945 6 BIOSITE INC BIOSITE INC NA NA FORMERLY BIOSITE DIAGNOSTICS
30 840401 04380911 004380911 004380 1 ACCUHEALTH INC ACCUHEALTH INC NA NA NA
31 847935 19615103 019615103 019615 3 ALLION HEALTHCARE INC ALLION HEALTHCARE INC NA NA NA
32 847935 19615103 019615103 019615 3 CARE GROUP INC ALLION HEALTHCARE INC NA NA NA
33 849547 91826107 091826107 091826 7 BLACK BOX CORP BLACK BOX CORP DEL NA NA NA
34 849547 91826107 091826107 091826 7 MB HOLDINGS INC BLACK BOX CORP DEL NA NA NA
35 850693 18490102 018490102 018490 2 ALLERGAN INC ALLERGAN INC NA NA NA
36 882291 00808Y09 000808Y09 000808 9 AETHLON MEDICAL INC ACA ABS 2004-1 LTD NA NA NA
37 882291 00808Y09 000808Y09 000808 9 BISHOP EQUITIES INC ACA ABS 2004-1 LTD NA NA NA
38 888711 06698104 006698104 006698 4 BANYAN SYSTEMS INC ADDMASTER CORP NA NA NA
39 888711 06698104 006698104 006698 4 EPRESENCE INC ADDMASTER CORP NA NA NA
40 895385 40228108 040228108 040228 8 ARGOSY GAMING CO ARGOSY GAMING CO NA NA NA
41 898437 35255108 035255108 035255 8 ANIKA RESEARCH INC ANIKA THERAPEUTICS INC NA NA FORMERLY ANIKA RESH INC TO
42 898437 35255108 035255108 035255 8 ANIKA THERAPEUTICS INC ANIKA THERAPEUTICS INC NA NA FORMERLY ANIKA RESH INC TO
43 898437 35255108 035255108 035255 8 ANIKA THERAPEUTICS, INC. ANIKA THERAPEUTICS INC NA NA FORMERLY ANIKA RESH INC TO
44 900708 78137106 078137106 078137 6 BELL MICROPRODUCTS INC BELL MICROPRODUCTS INC NA NA NA
45 902600 10196103 010196103 010196 3 AKSYS LTD AKSYS LTD NA NA NA
46 902791 68463108 068463108 068463 8 BARRETT BUSINESS SERVICES INC BARRETT BUSINESS SVCS INC NA NA NA
47 935226 82047200 082047200 082047 0 BENIHANA INC BENIHANA INC NA NA NA
48 940332 32515108 032515108 032515 8 ANADIGICS INC ANADIGICS INC NA NA NA
49 1002658 23437N10 023437N10 023437 0 DALEEN TECHNOLOGIES INC AMEE INC NA NA NAME CHANGED TO AQUAPLAN INC
50 1013238 38505103 038505103 038505 3 ARADIGM CORP ARADIGM CORP NA NA NA
51 1028087 25353103 025353103 025353 3 AMERICAN DENTAL PARTNERS INC AMERICAN DENTAL PARTNERS INC NA NA NA
52 1032462 90297110 090297110 090297 0 FF-TSY HOLDING CO II, LLC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
53 1032462 90297110 090297110 090297 0 TRUSTREET PROPERTIES INC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
54 1032462 90297110 090297110 090297 0 U S RESTAURANT PROPERTIES INC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
55 1061069 0534P109 00534P109 00534P 9 AVALON HOLDINGS CORP ADAGIO III CLO PLC NA NA NA
56 1085734 93679108 093679108 093679 8 BLOCKBUSTER INC BLOCKBUSTER INC NA NA NAME CHANGED TO BB LIQUIDATING
57 1089531 0504X207 00504X207 00504X 7 ACTIVEWORLDS COM INC ACTIVEWORLDS CORP NA NA NAME CHANGED TO KINGOLD
58 1089531 0504X207 00504X207 00504X 7 ACTIVEWORLDS CORP ACTIVEWORLDS CORP NA NA NAME CHANGED TO KINGOLD
59 1089531 0504X207 00504X207 00504X 7 KINGOLD JEWELRY, INC. ACTIVEWORLDS CORP NA NA NAME CHANGED TO KINGOLD
60 1116094 54923107 054923107 054923 7 AXCAN PHARMA INC AXCAN PHARMA INC NA NA NA
61 1117057 27297100 027297100 027297 0 AMERICAN LORAIN CORP AMERICAN LORAIN CORP NA NA NA
62 1117057 27297100 027297100 027297 0 MILLENNIUM QUEST INC AMERICAN LORAIN CORP NA NA NA
63 1117057 27297100 027297100 027297 0 PLANET GREEN HOLDINGS CORP. AMERICAN LORAIN CORP NA NA NA
64 1162315 23177108 023177108 023177 8 AMBASSADORS GROUP INC AMBASSADORS GROUP INC NA NA NA
65 1324105 8883T200 08883T200 08883T 0 BIDZ.COM, INC. BIDZ COM INC NA NA NA
66 1344413 15384100 015384100 015384 0 ALEXZA PHARMACEUTICALS INC. ALEXZA PHARMACEUTICALS INC NA NA NA
67 1518749 5501E201 05501E201 05501E 1 AZTECA ACQUISITION CORP AZTECA ACQUISITION CORP NA NA REORGANIZED AS HEMISPHERE

The matches from these are pretty good. The only clear non-matches are these ones

not_matches <- c(750556, 799233, 882291, 888711, 1002658, 1032462, 1061069)
View(cases_8 %>% filter(cik %in% not_matches) %>% 
        inner_join(issuers) %>% 
            left_join(ciks, by = 'cik') %>% 
            filter(substr(cusip_padded, 9, 9) == as.character(check_digit)) %>% 
            distinct(cik, cusip, cusip_padded, cusip6, check_digit, company_name, 
                         issuer_name_1, issuer_name_2, issuer_name_3, issuer_adl_1)
cik cusip cusip_padded cusip6 check_digit company_name issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1
1 750556 69149730 069149730 069149 0 SUNTRUST BANKS INC BARTELL MEDIA CORP NA NA MERGED INTO DOWNE
2 799233 42234104 042234104 042234 4 HEARTLAND EXPRESS INC ARMISTICE RES CORP NA NA NAME CHANGED TO KERR MINES INC
3 882291 00808Y09 000808Y09 000808 9 AETHLON MEDICAL INC ACA ABS 2004-1 LTD NA NA NA
4 882291 00808Y09 000808Y09 000808 9 BISHOP EQUITIES INC ACA ABS 2004-1 LTD NA NA NA
5 888711 06698104 006698104 006698 4 BANYAN SYSTEMS INC ADDMASTER CORP NA NA NA
6 888711 06698104 006698104 006698 4 EPRESENCE INC ADDMASTER CORP NA NA NA
7 1002658 23437N10 023437N10 023437 0 DALEEN TECHNOLOGIES INC AMEE INC NA NA NAME CHANGED TO AQUAPLAN INC
8 1032462 90297110 090297110 090297 0 FF-TSY HOLDING CO II, LLC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
9 1032462 90297110 090297110 090297 0 TRUSTREET PROPERTIES INC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
10 1032462 90297110 090297110 090297 0 U S RESTAURANT PROPERTIES INC BILTMORE FDS NA NA NAME CHANGED TO WACHOVIA FDS
11 1061069 0534P109 00534P109 00534P 9 AVALON HOLDINGS CORP ADAGIO III CLO PLC NA NA NA

So it seems we have 43 - 7 = 36 good matches from this analysis with 8-digit cusips padded with zeroes from the left.

bdcallen commented 4 years ago

> cases_8 %>% filter(!cik %in% not_matches) %>%  
               inner_join(issuers) %>%  
               left_join(ciks, by = 'cik') %>%  
               filter(substr(cusip_padded, 9, 9) == as.character(check_digit)) %>%  
               distinct(cik, cusip, cusip_padded, cusip6, check_digit, company_name,  
               issuer_name_1, issuer_name_2, issuer_name_3, issuer_adl_1, 
               issuer_adl_2, issuer_adl_3, issuer_status) %>% 
               inner_join(match_issuers %>% inner_join(issuers), by = c('cik', 'cusip')) %>% 
               select(cik, cusip, cusip_padded, check_digit, company_name, 
               issuer_name_1.x, issuer_name_1.y) %>% 
               rename(issuer_name_1_padded = issuer_name_1.x, 
                            issuer_name_1_unpadded = issuer_name_1.y) 
Joining, by = "cusip6" Joining, by = "cusip6" 
# A tibble: 4 x 7
      cik cusip    cusip_padded check_digit company_name                          issuer_name_1_padded          issuer_name_1_unpadded
        <int> <chr>    <chr>              <dbl> <chr>                                 <chr>                         <chr>                      
1 724024 28882108 028882108              8 AMERICAN PHYSICIANS SERVICE GROUP INC AMERICAN PHYSICIANS SVC GROUP ELLIS GABLES APT BLDG CORP 
2 849547 91826107 091826107              7 BLACK BOX CORP                        BLACK BOX CORP DEL            VLPS LTG SVCS INTL INC     
3 849547 91826107 091826107              7 MB HOLDINGS INC                       BLACK BOX CORP DEL            VLPS LTG SVCS INTL INC     
4 850693 18490102 018490102              2 ALLERGAN INC                          ALLERGAN INC                  CLEARLY CDN BEVERAGE CORP

Padding fixes 3 matches from the bad matches in match_issuers.

There is no overlap with the initial cases that matched stocknames

> cases_8 %>% filter(!cik %in% not_matches) %>% 
               inner_join(issuers) %>% 
               left_join(ciks, by = 'cik') %>% 
               filter(substr(cusip_padded, 9, 9) == as.character(check_digit)) %>% 
               distinct(cik, cusip, cusip_padded, cusip6, check_digit, company_name, 
                           issuer_name_1, issuer_name_2, issuer_name_3, issuer_adl_1, 
                             issuer_adl_2, issuer_adl_3, issuer_status)  %>% 
                             inner_join(match_stocknames)
Joining, by = "cusip6"
Joining, by = c("cik", "cusip")
# A tibble: 0 x 13
# … with 13 variables: cik <int>, cusip <chr>, cusip_padded <chr>, cusip6 <chr>, check_digit <dbl>, company_name <chr>, issuer_name_1 <chr>, issuer_name_2 <chr>,
#   issuer_name_3 <chr>, issuer_adl_1 <chr>, issuer_adl_2 <chr>, issuer_adl_3 <chr>, issuer_status <chr>
bdcallen commented 4 years ago

Here, I'm looking at cases where a record with an 8-digit cusip also has a valid 9-digit one, and subsequently, in how many cases can the 8-digit cusips push the 9-digit cases over the threshold of 10 for the multiplicity (ie. multiplicity(cusip8) + multiplicity(cusip9) >= 10). One can see that there are 465 cases in total, with 206 satisfying the constraint.

> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>%  distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   465
> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>% filter(multiplicity.x + multiplicity.y >= 10) %>% distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
      n
  <int>
1   206

I then looked to see if these cases matched issuers and stocknames. Below, you can see that 201 of the 206 cases matched issuers.

> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>% filter(multiplicity.x + multiplicity.y >= 10) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>% inner_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip) %>% count()
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 1
      n
  <int>
1   201

Then calling that matched dataframe m89_issuers, I looked at what matched stocknames but not issuers. It turns out that there are 4 such cases, and by comparing company_name and comnam, you can see that all 4 of them are good matches

m89_issuers <- cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>% filter(multiplicity.x + multiplicity.y >= 10) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>% inner_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip)

> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>% filter(multiplicity.x + multiplicity.y >= 10) %>% anti_join(m89_issuers)
Joining, by = "cik"
# A tibble: 5 x 5
      cik cusip.x  multiplicity.x cusip.y   multiplicity.y
    <int> <chr>             <int> <chr>              <int>
1  874516 M2051410             18 M20514101              7
2 1078276 G6116R10             24 G6116R101              4
3 1294649 93166110              6 931661102              6
4 1609550 45773010              2 457730109              8
5 1710340 29772L10              2 29772L108              8

> cusip8s %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip_length.y == 9) %>% filter(cusip.x == substr(cusip.y, 1, 8), substr(cusip.y, 9, 9) == as.character(check_digit.y)) %>% distinct(cik, cusip.x, multiplicity.x, cusip.y, multiplicity.y) %>% filter(multiplicity.y < 10) %>% filter(multiplicity.x + multiplicity.y >= 10) %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x) %>% mutate(ncusip = cusip) %>% inner_join(stocknames) %>% left_join(ciks) %>% anti_join(m89_issuers) %>% distinct(cik, cusip, company_name, comnam)
Joining, by = c("cusip", "ncusip")
Joining, by = "cik"
Joining, by = c("cik", "cusip")
# A tibble: 4 x 4
      cik cusip    company_name                  comnam                     
    <int> <chr>    <chr>                         <chr>                      
1  874516 M2051410 BVR TECHNOLOGIES LTD          B V R TECHNOLOGIES LTD     
2 1078276 G6116R10 MIH LTD                       M I H LTD TORTOLA          
3 1609550 45773010 INSPIRE MEDICAL SYSTEMS, INC. INSPIRE MEDICAL SYSTEMS INC
4 1710340 29772L10 ETON PHARMACEUTICALS, INC.    ETON PHARMACEUTICALS INC   

So 205 of the 206 cases of interest match either issuers or stocknames, with cik = 1294649, cusip = 93166110 the only case not match either. I then looked at the matches in issuers to see how the company_name compared with issuer_name_1 (or issuer_adl_1 in some cases, as that often contains old/new names). These were the only bad matches I found, corresponding to 7 distinct (cik, cusip) pairs

> m89_issuers %>% mutate(cusip6 = substr(cusip, 1, 6)) %>% inner_join(issuers) %>% left_join(ciks) %>% distinct(cik, cusip, company_name, issuer_name_1, issuer_name_2, issuer_name_3,  issuer_adl_1) %>% filter(cik %in% c(882304, 882800, 909109, 915781, 946842, 1011632, 1363589))
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 9 x 7
      cik cusip    company_name                           issuer_name_1            issuer_name_2 issuer_name_3 issuer_adl_1
    <int> <chr>    <chr>                                  <chr>                    <chr>         <chr>         <chr>       
1  882304 48625010 ACCESS HEALTH INC                      KAUFMANN FD INC          NA            NA            NA          
2  882800 46985840 AMERICAN EAGLE GROUP INC               JACOR COMMUNICATIONS INC NA            NA            NA          
3  882800 46985840 PACIFIC VENTURES GROUP, INC.           JACOR COMMUNICATIONS INC NA            NA            NA          
4  909109 48625010 PHILADELPHIA CONSOLIDATED HOLDING CORP KAUFMANN FD INC          NA            NA            NA          
5  915781 48625010 RTW INC /MN/                           KAUFMANN FD INC          NA            NA            NA          
6  946842 48625010 AMBASSADORS INTERNATIONAL INC          KAUFMANN FD INC          NA            NA            NA          
7 1011632 46985840 PERITUS SOFTWARE SERVICES INC          JACOR COMMUNICATIONS INC NA            NA            NA          
8 1363589 55035110 CARILION CLINIC                        LUNA INNOVATIONS INC     NA            NA            NA          
9 1363589 55035110 CARILION HEALTH SYSTEM                 LUNA INNOVATIONS INC     NA            NA            NA          

So it seems that the 8-digit cusips can also provide substantial value by augmenting the data amongst 9-digit cusips (which have the correct check digit), though there are a small number of wrong matches to look out for.

iangow commented 4 years ago

@bdcallen I simply can't process this much information. Can you please pull it together into one or two comments so that we can decide what to do here.

I think indeed we want to know how many CUSIP-CIKs combinations get pushed over the threshold (e.g., 10 filings) by adding 8-digit CUSIPs to 9-digit CUSIPs. We may also want to consider how many cases then create issues along the lines considered in #90.

iangow commented 4 years ago

@bdcallen As I mentioned in May, it's hard for me to process this much detail. As far as I can tell we have about 180 firms (now that I'm using cusip_cik_test as the starting point rather than cusip_cik). I think you have done some analysis of potential matches and identified ones that look good (and ones that don't). Rather than trying to "brute force" this will clever coding, the easiest approach is probably to create a spreadsheet with the candidate matches and a classification into "good" and "bad". We could then use that table to override data about eight-digit matches found in the "raw" cusip_cik when making cusip_cik_test (as I am calling it for the moment).

I earlier made a Google Sheets document that could be used to that end. You could populate that in a number of ways (often I have just exported using readr::write_csv and then copy-pasted). It might make sense to do something similar with seven-digit CUSIPs.

bdcallen commented 4 years ago

@iangow Like I did with my previous post in #89, I'm going to discuss some detail of what I did, and then just state where you can find my result. Further down in my notebook handle_cusip_cik_exceptions.ipynb, you can find some cells defining dataframes added_value_8s_above_10_w_issuers, added_value_8s_above_10_stocknames_only, added_value_8s_above_10_padded_w_issuers and added_value_8s_above_10_padded_stocknames. The dataframes containing "issuers" in the name are those that join onto cusipm.issuer, and those containing "stocknames" are those that join onto crsp.stocknames but not cusipm.issuer. The latter two dataframes with "padded" in the name are where we set cusip to be the cusip_raw but padded with a zero on the left; ie. if the cusip_raw is '95825910', then cusip is '095825910'. I didn't bother with approximate string matching for all these cases, as the number of cases to consider was low (so sim_index_norm and sim_index_max are null for these entries).

You can find the result for the unpadded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 8 AND cusip = cusip_raw

and the result for the padded cusips using

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 8 AND cusip != cusip_raw

I believe we can close this issue now.