Examine and handle 'wrong' 9-digit cusip matches

@iangow Last night, I did a random sample of 500 cik-cusip combinations with valid 9-digit cusips, and then joined them to issuers to see how many were good matches and how many bad.

> cusip9s <- cusip_cik %>% filter(cusip_length == 9)
> m9_issuers <- cusip9s %>% 
        filter(multiplicity >= 10 & 
              substr(cusip, 9, 9) == as.character(check_digit)) %>% 
        distinct(cik, cusip, cusip6) %>% inner_join(issuers) %>% 
        left_join(ciks) %>% 
        distinct(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2, 
                     issuer_name_3, issuer_adl_1, issuer_adl_2, issuer_adl_3, issuer_adl_4)
Joining, by = "cusip6"
Joining, by = "cik"
> cusip9s %>% filter(multiplicity >= 10 & 
        substr(cusip, 9, 9) == as.character(check_digit)) %>% 
        distinct(cik, cusip, cusip6) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 22307
> m9_issuers %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 19748

> rand_samp <- sample(1:19748, 500)

> View(m9_issuers %>% distinct(cusip, cik) %>% slice(rand_samp) %>% inner_join(m9_issuers))

Have looked through the whole, set, I found 10 clearly bad matches (from bad_match_ciks), and 4 toss-ups (in soft_match_ciks).

bad_match_ciks <- c(839470, 37643, 842638, 1107421, 1013785, 1267753, 798738, 930548, 1421601, 30697)

soft_match_ciks <- c(1396502, 1383054, 1117603, 1006249)

Doing

View(m9_issuers %>% filter(cik %in% bad_match_ciks))

one gets

	cik	cusip	cusip6	company_name	issuer_name_1	issuer_name_2	issuer_name_3	issuer_adl_1	issuer_adl_2	issuer_adl_3	issuer_adl_4
1	30697	769667106	769667	TRIARC COMPANIES INC	RIVUS BD FD	NA	NA	NAME CHANGED TO CUTWATER	SELECT INCOME FD 12/09/2011	SEE 232229	NA
2	30697	769667106	769667	WENDY'S CO	RIVUS BD FD	NA	NA	NAME CHANGED TO CUTWATER	SELECT INCOME FD 12/09/2011	SEE 232229	NA
3	30697	769667106	769667	WENDY'S/ARBY'S GROUP, INC.	RIVUS BD FD	NA	NA	NAME CHANGED TO CUTWATER	SELECT INCOME FD 12/09/2011	SEE 232229	NA
4	30697	895927101	895927	TRIARC COMPANIES INC	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
5	30697	895927101	895927	WENDY'S CO	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
6	30697	895927101	895927	WENDY'S/ARBY'S GROUP, INC.	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
7	30697	895927309	895927	TRIARC COMPANIES INC	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
8	30697	895927309	895927	WENDY'S CO	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
9	30697	895927309	895927	WENDY'S/ARBY'S GROUP, INC.	TRIARC COS INC	NA	NA	REORGANIZED AS WENDYS / ARBYS	GROUP INC 09/29/2008	NA	NA
10	30697	929903102	929903	TRIARC COMPANIES INC	WACHOVIA CORP NEW	NA	NA	MERGED INTO WELLS FARGO & CO	NEW 01/01/2009	NA	NA
11	30697	929903102	929903	WENDY'S CO	WACHOVIA CORP NEW	NA	NA	MERGED INTO WELLS FARGO & CO	NEW 01/01/2009	NA	NA
12	30697	929903102	929903	WENDY'S/ARBY'S GROUP, INC.	WACHOVIA CORP NEW	NA	NA	MERGED INTO WELLS FARGO & CO	NEW 01/01/2009	NA	NA
13	30697	950587105	950587	TRIARC COMPANIES INC	WENDYS / ARBYS GROUP INC	NA	NA	NAME CHANGED TO WENDYS CO	07/05/2011	SEE 95058W	NA
14	30697	950587105	950587	WENDY'S CO	WENDYS / ARBYS GROUP INC	NA	NA	NAME CHANGED TO WENDYS CO	07/05/2011	SEE 95058W	NA
15	30697	950587105	950587	WENDY'S/ARBY'S GROUP, INC.	WENDYS / ARBYS GROUP INC	NA	NA	NAME CHANGED TO WENDYS CO	07/05/2011	SEE 95058W	NA
16	30697	95058W100	95058W	TRIARC COMPANIES INC	WENDYS CO	NA	NA	NA	NA	NA	NA
17	30697	95058W100	95058W	WENDY'S CO	WENDYS CO	NA	NA	NA	NA	NA	NA
18	30697	95058W100	95058W	WENDY'S/ARBY'S GROUP, INC.	WENDYS CO	NA	NA	NA	NA	NA	NA
19	37643	341135101	341135	FLORIDA PUBLIC UTILITIES CO	FLORIDA PUB UTILS CO	NA	NA	NA	NA	NA	NA
20	37643	929903102	929903	FLORIDA PUBLIC UTILITIES CO	WACHOVIA CORP NEW	NA	NA	MERGED INTO WELLS FARGO & CO	NEW 01/01/2009	NA	NA
21	798738	811183102	811183	SCUDDER NEW ASIA FUND INC	SCUDDER NEW ASIA FD INC	NA	NA	NA	NA	NA	NA
22	798738	929903102	929903	SCUDDER NEW ASIA FUND INC	WACHOVIA CORP NEW	NA	NA	MERGED INTO WELLS FARGO & CO	NEW 01/01/2009	NA	NA
23	839470	203744107	203744	URANIUM RESOURCES INC /DE/	COMMUNITY MED TRANS INC	NA	NA	NA	NA	NA	NA
24	839470	203744107	203744	WESTWATER RESOURCES, INC.	COMMUNITY MED TRANS INC	NA	NA	NA	NA	NA	NA
25	839470	916901309	916901	URANIUM RESOURCES INC /DE/	URANIUM RES INC	NA	NA	NA	NA	NA	NA
26	839470	916901309	916901	WESTWATER RESOURCES, INC.	URANIUM RES INC	NA	NA	NA	NA	NA	NA
27	839470	916901507	916901	URANIUM RESOURCES INC /DE/	URANIUM RES INC	NA	NA	NA	NA	NA	NA
28	839470	916901507	916901	WESTWATER RESOURCES, INC.	URANIUM RES INC	NA	NA	NA	NA	NA	NA
29	839470	916901606	916901	URANIUM RESOURCES INC /DE/	URANIUM RES INC	NA	NA	NA	NA	NA	NA
30	839470	916901606	916901	WESTWATER RESOURCES, INC.	URANIUM RES INC	NA	NA	NA	NA	NA	NA
31	842638	238108203	238108	VERSUS TECHNOLOGY INC	DATARAM CORP	NA	NA	NA	NA	NA	NA
32	842638	925313108	925313	VERSUS TECHNOLOGY INC	VERSUS TECHNOLOGY INC	NA	NA	NA	NA	NA	NA
33	930548	411465107	411465	RECKSON ASSOCIATES REALTY CORP	HARBOR BANKSHARES CORP	NA	NA	NA	NA	NA	NA
34	930548	75621K106	75621K	RECKSON ASSOCIATES REALTY CORP	RECKSON ASSOCS RLTY CORP	NA	NA	NA	NA	NA	NA
35	930548	75621K304	75621K	RECKSON ASSOCIATES REALTY CORP	RECKSON ASSOCS RLTY CORP	NA	NA	NA	NA	NA	NA
36	930548	94856P102	94856P	RECKSON ASSOCIATES REALTY CORP	WEEKS CORP	NA	NA	REORGANIZED AS DUKE-WEEKS	REALTY CORP TO 07/02/1999	NA	NA
37	1013785	23077R100	23077R	GOLDBELT RESOURCES LTD	CUMBERLAND RES LTD	NA	NA	NA	NA	NA	NA
38	1013785	380755405	380755	GOLDBELT RESOURCES LTD	GOLDBELT RES LTD	NA	NA	FORMERLY GOLBELT MINES INC TO	07/15/1991	NA	NA
39	1013785	380755959	380755	GOLDBELT RESOURCES LTD	GOLDBELT RES LTD	NA	NA	FORMERLY GOLBELT MINES INC TO	07/15/1991	NA	NA
40	1107421	89365K206	89365K	EASYWEB INC	TRANSGENOMIC INC	NA	NA	NA	NA	NA	NA
41	1107421	89365K206	89365K	ZIOPHARM ONCOLOGY INC	TRANSGENOMIC INC	NA	NA	NA	NA	NA	NA
42	1107421	98973P101	98973P	EASYWEB INC	ZIOPHARM ONCOLOGY INC	NA	NA	NA	NA	NA	NA
43	1107421	98973P101	98973P	ZIOPHARM ONCOLOGY INC	ZIOPHARM ONCOLOGY INC	NA	NA	NA	NA	NA	NA
44	1267753	21988G619	21988G	LEHMAN ABS CORP BCKD TR CRTS TOYS R US DB BCK SE 01-31	CORPORATE BACKED TR CTFS	NA	USE 21988K FOR EQUITY ISSUES	NA	NA	NA	NA
45	1421601	44920E104	44920E	WESTMOUNTAIN GOLD, INC.	IA GLOBAL INC	NA	NA	MERGED INTO ASURA DEV GROUP	INC 10/01/2012 SEE 04650E	NA	NA
46	1421601	44920E104	44920E	WESTMOUNTAIN INDEX ADVISOR INC	IA GLOBAL INC	NA	NA	MERGED INTO ASURA DEV GROUP	INC 10/01/2012 SEE 04650E	NA	NA
47	1421601	96110W203	96110W	WESTMOUNTAIN GOLD, INC.	WESTMOUNTAIN INDEX ADVISOR INC	NA	NA	NAME CHANGED TO WESTMOUNTAIN	GOLD INC 02/28/2012 SEE	96111A	NA
48	1421601	96110W203	96110W	WESTMOUNTAIN INDEX ADVISOR INC	WESTMOUNTAIN INDEX ADVISOR INC	NA	NA	NAME CHANGED TO WESTMOUNTAIN	GOLD INC 02/28/2012 SEE	96111A	NA

The soft matches are

> m9_issuers %>% filter(cik %in% soft_match_ciks)
# A tibble: 14 x 11
       cik cusip   cusip6 company_name                      issuer_name_1        issuer_name_2 issuer_name_3 issuer_adl_1        issuer_adl_2 issuer_adl_3 issuer_adl_4
     <int> <chr>   <chr>  <chr>                             <chr>                <chr>         <chr>         <chr>               <chr>        <chr>        <chr>       
 1 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 2 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 3 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 4 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 5 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 6 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 7 1117603 29081M… 29081M EMBRAER BRAZILIAN AVIATION CO INC EMBRAER-EMPRESA BRA… AERONAUTICA … NA            NAME CHANGED TO EM… 11/23/2010   SEE 29082A   NA          
 8 1117603 29081M… 29081M EMBRAER BRAZILIAN AVIATION CO     EMBRAER-EMPRESA BRA… AERONAUTICA … NA            NAME CHANGED TO EM… 11/23/2010   SEE 29082A   NA          
 9 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIATION CO INC EMBRAER S A          NA            NA            NA                  NA           NA           NA          
10 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIATION CO     EMBRAER S A          NA            NA            NA                  NA           NA           NA          
11 1383054 73936B… 73936B INVESCO DB SILVER FUND            POWERSHARES DB MULT… COMMODITY TR  NA            NA                  NA           NA           NA          
12 1383054 73936B… 73936B POWERSHARES DB SILVER FUND        POWERSHARES DB MULT… COMMODITY TR  NA            NA                  NA           NA           NA          
13 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANTAGED GLOB… HANCOCK JOHN INVT TR NA            NA            NA                  NA           NA           NA          
14 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANTAGED GLOB… HANCOCK JOHN INVT TR NA            NA            NA                  NA           NA           N

So it seems we have a wrong match error rate of around 2-3 percent, even with the constraint of the cusips having a multiplicity of at least 10. As you can see in the bad matches, some company names from issuers are quite common (like Wachovia Corp). Here, these common filers seem to be usually the filer of the filing, not the subject. So perhaps we should be extracting the filer cik and company_name along with those of the subject, to enable the correct comparison.

Anything that requires parsing should be done later.

@bdcallen Please fix this issue so that the code is valid reprex code, not incomplete code that cannot be re-run. Also, please point the new code at cusip_cik_test, not cusip_cik, so that we're only addressing issues that remain in the data after resolving earlier issues.

@iangow So I have spent some time learning the reprex package. Here's the code, pointed at cusip_cik_test. Note that I have included bad_match_cusips and soft_match_cusips this time, which I should have done previously in the tables above

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

cusip_cik_test <- 
  tbl(pg, sql('SELECT * FROM edgar.cusip_cik_test')) %>% 
  group_by(cik, cusip) %>% 
  mutate(multiplicity = n()) %>% 
  # Add multiplicity as a column
  # Add cusip_length as a column
  mutate(cusip_length = nchar(cusip))%>%
  # Add cusip6 as a column
  mutate(cusip6 = substr(cusip, 1, 6)) %>%
  group_by(cik)

issuers <- 
  tbl(pg, sql('SELECT * FROM cusipm.issuer')) %>% 
  rename(cusip6 = issuer_num) # change issuer_num to cusip6

ciks <- tbl(pg, sql('SELECT * FROM edgar.ciks'))

cusip9s <- cusip_cik_test %>% filter(cusip_length == 9)
m9_issuers <- 
  cusip9s %>%
  filter(multiplicity >= 10 &
           substr(cusip, 9, 9) == as.character(check_digit)) %>%
  distinct(cik, cusip, cusip6) %>% inner_join(issuers) %>%
  left_join(ciks) %>%
  distinct(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2,
           issuer_name_3, issuer_adl_1, issuer_adl_2, issuer_adl_3, issuer_adl_4)
#> Joining, by = "cusip6"
#> Joining, by = "cik"

cusip9s %>% 
  filter(multiplicity >= 10,
         substr(cusip, 9, 9) == as.character(check_digit)) %>%
  distinct(cik, cusip, cusip6) %>% 
  count()
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>      cik n      
#>    <int> <int64>
#>  1    20 1      
#>  2  1750 1      
#>  3  1800 1      
#>  4  1923 2      
#>  5  1961 1      
#>  6  1985 1      
#>  7  2034 1      
#>  8  2062 1      
#>  9  2070 1      
#> 10  2098 1      
#> # … with more rows

m9_issuers %>% 
  distinct(cusip, cik) %>%
  count()
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>      cik n      
#>    <int> <int64>
#>  1    20 1      
#>  2  1750 1      
#>  3  1800 1      
#>  4  1923 2      
#>  5  1961 1      
#>  6  1985 1      
#>  7  2034 1      
#>  8  2062 1      
#>  9  2070 1      
#> 10  2098 1      
#> # … with more rows

bad_match_ciks <- c(839470, 37643, 842638, 1107421, 1013785, 
                    1267753, 798738, 930548, 1421601, 30697)

bad_match_cusips <- c('203744107', '929903102', '238108203',
                      '89365K206', '23077R100', '21988G619',
                      '929903102', '94856P102', '44920E104', 
                      '769667106')

bad_match_df <- tibble(cik = bad_match_ciks, cusip = bad_match_cusips)

soft_match_ciks <- c(1396502, 1383054, 1117603, 1006249)

soft_match_cusips <- c('41013P749', '73936B309',
                       '29082A107', '464287481')

soft_match_df <- tibble(cik = soft_match_ciks, 
                        cusip = soft_match_cusips)

m9_issuers %>% 
  inner_join(bad_match_df, copy = TRUE) %>% 
  select(cik, cusip, cusip6, company_name,
         issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cik
#>       cik cusip  cusip6 company_name              issuer_name_1    issuer_name_2
#>     <int> <chr>  <chr>  <chr>                     <chr>            <chr>        
#> 1  839470 20374… 203744 URANIUM RESOURCES INC /D… COMMUNITY MED T… <NA>         
#> 2  839470 20374… 203744 WESTWATER RESOURCES, INC. COMMUNITY MED T… <NA>         
#> 3 1267753 21988… 21988G LEHMAN ABS CORP BCKD TR … CORPORATE BACKE… <NA>

m9_issuers %>% 
  inner_join(soft_match_df, copy = TRUE) %>% 
  select(cik, cusip, cusip6, company_name, 
         issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cik
#>       cik cusip   cusip6 company_name             issuer_name_1    issuer_name_2
#>     <int> <chr>   <chr>  <chr>                    <chr>            <chr>        
#> 1 1006249 464287… 464287 BARCLAYS GLOBAL FUND AD… ISHARES TR       <NA>         
#> 2 1006249 464287… 464287 BLACKROCK FUND ADVISORS  ISHARES TR       <NA>         
#> 3 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A      <NA>         
#> 4 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A      <NA>         
#> 5 1383054 73936B… 73936B INVESCO DB SILVER FUND   POWERSHARES DB … COMMODITY TR 
#> 6 1383054 73936B… 73936B POWERSHARES DB SILVER F… POWERSHARES DB … COMMODITY TR 
#> 7 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>         
#> 8 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

I still have the previously defined m9_issuers in my RStudio, here's what you get with the bad and soft match cusips added.

> m9_issuers %>% inner_join(bad_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 15 x 6
       cik cusip     cusip6 company_name                                           issuer_name_1            issuer_name_2
     <dbl> <chr>     <chr>  <chr>                                                  <chr>                    <chr>        
 1   30697 769667106 769667 TRIARC COMPANIES INC                                   RIVUS BD FD              NA           
 2   30697 769667106 769667 WENDY'S CO                                             RIVUS BD FD              NA           
 3   30697 769667106 769667 WENDY'S/ARBY'S GROUP, INC.                             RIVUS BD FD              NA           
 4   37643 929903102 929903 FLORIDA PUBLIC UTILITIES CO                            WACHOVIA CORP NEW        NA           
 5  798738 929903102 929903 SCUDDER NEW ASIA FUND INC                              WACHOVIA CORP NEW        NA           
 6  839470 203744107 203744 URANIUM RESOURCES INC /DE/                             COMMUNITY MED TRANS INC  NA           
 7  839470 203744107 203744 WESTWATER RESOURCES, INC.                              COMMUNITY MED TRANS INC  NA           
 8  842638 238108203 238108 VERSUS TECHNOLOGY INC                                  DATARAM CORP             NA           
 9  930548 94856P102 94856P RECKSON ASSOCIATES REALTY CORP                         WEEKS CORP               NA           
10 1013785 23077R100 23077R GOLDBELT RESOURCES LTD                                 CUMBERLAND RES LTD       NA           
11 1107421 89365K206 89365K EASYWEB INC                                            TRANSGENOMIC INC         NA           
12 1107421 89365K206 89365K ZIOPHARM ONCOLOGY INC                                  TRANSGENOMIC INC         NA           
13 1267753 21988G619 21988G LEHMAN ABS CORP BCKD TR CRTS TOYS R US DB BCK SE 01-31 CORPORATE BACKED TR CTFS NA           
14 1421601 44920E104 44920E WESTMOUNTAIN GOLD, INC.                                IA GLOBAL INC            NA           
15 1421601 44920E104 44920E WESTMOUNTAIN INDEX ADVISOR INC                         IA GLOBAL INC            NA           
> m9_issuers %>% inner_join(soft_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 8 x 6
      cik cusip     cusip6 company_name                                              issuer_name_1               issuer_name_2
    <dbl> <chr>     <chr>  <chr>                                                     <chr>                       <chr>        
1 1006249 464287481 464287 BARCLAYS GLOBAL FUND ADVISORS                             ISHARES TR                  NA           
2 1006249 464287481 464287 BLACKROCK FUND ADVISORS                                   ISHARES TR                  NA           
3 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO INC                         EMBRAER S A                 NA           
4 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO                             EMBRAER S A                 NA           
5 1383054 73936B309 73936B INVESCO DB SILVER FUND                                    POWERSHARES DB MULTI-SECTOR COMMODITY TR 
6 1383054 73936B309 73936B POWERSHARES DB SILVER FUND                                POWERSHARES DB MULTI-SECTOR COMMODITY TR 
7 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL SHAREHOLDER YIELD FUND HANCOCK JOHN INVT TR        NA           
8 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL YIELD FUND             HANCOCK JOHN INVT TR        NA

So indeed, some of the bad matches have gone away in cusip_cik_text

@bdcallen Add confirmed "bad matches" to the spreadsheet here. Perhaps add a bad_matches tab. Please include details in a column explaining why a match is bad (e.g., "filer used its own CUSIP rather than issuer's"). We can sort out the process for incorporating this information later. Only include matches in the cusip_cik_test table, not ones that have already been handled elsewhere.

@iangow Ok, I have committed my jupyter notebook, handle_cusip_cik_exceptions.ipynb, which is designed to be used to update a table I made earlier called cusip_cik_exceptions. It contains code to define dataframes which I used to help identify the wrong matches; in this issue the most important ones are valid9s_above_10_w_issuers, which contains the pairs which can join onto cusipm.issuer (and perhaps crsp.stocknames as well), and nines_w_stocknames, which contains the pairs which can join onto crsp.stocknames but not cusipm.issuer. With the former, I calculated a variable called sim_index_norm, which is the an index between 0 and 1 based on the Levenshtein distance between the company names in valid9s_above_10_w_issuers and the corresponding names in cusipm.issuer, where 1 means a complete string match, and 0 a complete non-match. I then calculated sim_index_max, which is the maximum of sim_index_norm when grouping by (cik, cusip). I then choose to look at the rows in

valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8)

I then did something similar for nines_w_stocknames. The extended details of how I did this and more is rather lengthy, and in my opinion, more appropriately covered in some documentation on cusip_cik_exceptions. So for the purposes of this post, I would like to just state where you can find my result. The rows which correspond to the 9-digit cusips analysed can be found with the query

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 9

Here, cusip_raw is always the raw cusip from cusip_cik_test, which in this case is just equal to cusip (for the some of the cases covered in some of the other issues on 6,7 and 8 digit cusips, this is not the case. For instance in cases where I analysed cusips left-padded with a zero, cusip is the modified cusip. I'll explain more in the other issues).

One of the most important columns here is the field valid_match, which is set to TRUE if a match is correct, FALSE if incorrect, and NULL if undecided. I chose to include the TRUE cases here from those analysed so that in future we do not have to cover the same ground, ie. we can do

valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8) %>% anti_join(cusip_cik_exceptions)

I'm going to make similar posts on issues #86, #87 and #88, and then open a new issue for making documentation on cusip_cik_exceptions.

I believe we can close this issue now.

mccgr / edgar

Examine and handle 'wrong' 9-digit cusip matches #89