mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Examine and handle 'wrong' 9-digit cusip matches #89

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow Last night, I did a random sample of 500 cik-cusip combinations with valid 9-digit cusips, and then joined them to issuers to see how many were good matches and how many bad.

> cusip9s <- cusip_cik %>% filter(cusip_length == 9)
> m9_issuers <- cusip9s %>% 
        filter(multiplicity >= 10 & 
              substr(cusip, 9, 9) == as.character(check_digit)) %>% 
        distinct(cik, cusip, cusip6) %>% inner_join(issuers) %>% 
        left_join(ciks) %>% 
        distinct(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2, 
                     issuer_name_3, issuer_adl_1, issuer_adl_2, issuer_adl_3, issuer_adl_4)
Joining, by = "cusip6"
Joining, by = "cik"
> cusip9s %>% filter(multiplicity >= 10 & 
        substr(cusip, 9, 9) == as.character(check_digit)) %>% 
        distinct(cik, cusip, cusip6) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 22307
> m9_issuers %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 19748

> rand_samp <- sample(1:19748, 500)

> View(m9_issuers %>% distinct(cusip, cik) %>% slice(rand_samp) %>% inner_join(m9_issuers))

Have looked through the whole, set, I found 10 clearly bad matches (from bad_match_ciks), and 4 toss-ups (in soft_match_ciks).

bad_match_ciks <- c(839470, 37643, 842638, 1107421, 1013785, 1267753, 798738, 930548, 1421601, 30697)

soft_match_ciks <- c(1396502, 1383054, 1117603, 1006249)

Doing

View(m9_issuers %>% filter(cik %in% bad_match_ciks))

one gets

cik cusip cusip6 company_name issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1 issuer_adl_2 issuer_adl_3 issuer_adl_4
1 30697 769667106 769667 TRIARC COMPANIES INC RIVUS BD FD NA NA NAME CHANGED TO CUTWATER SELECT INCOME FD 12/09/2011 SEE 232229 NA
2 30697 769667106 769667 WENDY'S CO RIVUS BD FD NA NA NAME CHANGED TO CUTWATER SELECT INCOME FD 12/09/2011 SEE 232229 NA
3 30697 769667106 769667 WENDY'S/ARBY'S GROUP, INC. RIVUS BD FD NA NA NAME CHANGED TO CUTWATER SELECT INCOME FD 12/09/2011 SEE 232229 NA
4 30697 895927101 895927 TRIARC COMPANIES INC TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
5 30697 895927101 895927 WENDY'S CO TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
6 30697 895927101 895927 WENDY'S/ARBY'S GROUP, INC. TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
7 30697 895927309 895927 TRIARC COMPANIES INC TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
8 30697 895927309 895927 WENDY'S CO TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
9 30697 895927309 895927 WENDY'S/ARBY'S GROUP, INC. TRIARC COS INC NA NA REORGANIZED AS WENDYS / ARBYS GROUP INC 09/29/2008 NA NA
10 30697 929903102 929903 TRIARC COMPANIES INC WACHOVIA CORP NEW NA NA MERGED INTO WELLS FARGO & CO NEW 01/01/2009 NA NA
11 30697 929903102 929903 WENDY'S CO WACHOVIA CORP NEW NA NA MERGED INTO WELLS FARGO & CO NEW 01/01/2009 NA NA
12 30697 929903102 929903 WENDY'S/ARBY'S GROUP, INC. WACHOVIA CORP NEW NA NA MERGED INTO WELLS FARGO & CO NEW 01/01/2009 NA NA
13 30697 950587105 950587 TRIARC COMPANIES INC WENDYS / ARBYS GROUP INC NA NA NAME CHANGED TO WENDYS CO 07/05/2011 SEE 95058W NA
14 30697 950587105 950587 WENDY'S CO WENDYS / ARBYS GROUP INC NA NA NAME CHANGED TO WENDYS CO 07/05/2011 SEE 95058W NA
15 30697 950587105 950587 WENDY'S/ARBY'S GROUP, INC. WENDYS / ARBYS GROUP INC NA NA NAME CHANGED TO WENDYS CO 07/05/2011 SEE 95058W NA
16 30697 95058W100 95058W TRIARC COMPANIES INC WENDYS CO NA NA NA NA NA NA
17 30697 95058W100 95058W WENDY'S CO WENDYS CO NA NA NA NA NA NA
18 30697 95058W100 95058W WENDY'S/ARBY'S GROUP, INC. WENDYS CO NA NA NA NA NA NA
19 37643 341135101 341135 FLORIDA PUBLIC UTILITIES CO FLORIDA PUB UTILS CO NA NA NA NA NA NA
20 37643 929903102 929903 FLORIDA PUBLIC UTILITIES CO WACHOVIA CORP NEW NA NA MERGED INTO WELLS FARGO & CO NEW 01/01/2009 NA NA
21 798738 811183102 811183 SCUDDER NEW ASIA FUND INC SCUDDER NEW ASIA FD INC NA NA NA NA NA NA
22 798738 929903102 929903 SCUDDER NEW ASIA FUND INC WACHOVIA CORP NEW NA NA MERGED INTO WELLS FARGO & CO NEW 01/01/2009 NA NA
23 839470 203744107 203744 URANIUM RESOURCES INC /DE/ COMMUNITY MED TRANS INC NA NA NA NA NA NA
24 839470 203744107 203744 WESTWATER RESOURCES, INC. COMMUNITY MED TRANS INC NA NA NA NA NA NA
25 839470 916901309 916901 URANIUM RESOURCES INC /DE/ URANIUM RES INC NA NA NA NA NA NA
26 839470 916901309 916901 WESTWATER RESOURCES, INC. URANIUM RES INC NA NA NA NA NA NA
27 839470 916901507 916901 URANIUM RESOURCES INC /DE/ URANIUM RES INC NA NA NA NA NA NA
28 839470 916901507 916901 WESTWATER RESOURCES, INC. URANIUM RES INC NA NA NA NA NA NA
29 839470 916901606 916901 URANIUM RESOURCES INC /DE/ URANIUM RES INC NA NA NA NA NA NA
30 839470 916901606 916901 WESTWATER RESOURCES, INC. URANIUM RES INC NA NA NA NA NA NA
31 842638 238108203 238108 VERSUS TECHNOLOGY INC DATARAM CORP NA NA NA NA NA NA
32 842638 925313108 925313 VERSUS TECHNOLOGY INC VERSUS TECHNOLOGY INC NA NA NA NA NA NA
33 930548 411465107 411465 RECKSON ASSOCIATES REALTY CORP HARBOR BANKSHARES CORP NA NA NA NA NA NA
34 930548 75621K106 75621K RECKSON ASSOCIATES REALTY CORP RECKSON ASSOCS RLTY CORP NA NA NA NA NA NA
35 930548 75621K304 75621K RECKSON ASSOCIATES REALTY CORP RECKSON ASSOCS RLTY CORP NA NA NA NA NA NA
36 930548 94856P102 94856P RECKSON ASSOCIATES REALTY CORP WEEKS CORP NA NA REORGANIZED AS DUKE-WEEKS REALTY CORP TO 07/02/1999 NA NA
37 1013785 23077R100 23077R GOLDBELT RESOURCES LTD CUMBERLAND RES LTD NA NA NA NA NA NA
38 1013785 380755405 380755 GOLDBELT RESOURCES LTD GOLDBELT RES LTD NA NA FORMERLY GOLBELT MINES INC TO 07/15/1991 NA NA
39 1013785 380755959 380755 GOLDBELT RESOURCES LTD GOLDBELT RES LTD NA NA FORMERLY GOLBELT MINES INC TO 07/15/1991 NA NA
40 1107421 89365K206 89365K EASYWEB INC TRANSGENOMIC INC NA NA NA NA NA NA
41 1107421 89365K206 89365K ZIOPHARM ONCOLOGY INC TRANSGENOMIC INC NA NA NA NA NA NA
42 1107421 98973P101 98973P EASYWEB INC ZIOPHARM ONCOLOGY INC NA NA NA NA NA NA
43 1107421 98973P101 98973P ZIOPHARM ONCOLOGY INC ZIOPHARM ONCOLOGY INC NA NA NA NA NA NA
44 1267753 21988G619 21988G LEHMAN ABS CORP BCKD TR CRTS TOYS R US DB BCK SE 01-31 CORPORATE BACKED TR CTFS NA USE 21988K FOR EQUITY ISSUES NA NA NA NA
45 1421601 44920E104 44920E WESTMOUNTAIN GOLD, INC. IA GLOBAL INC NA NA MERGED INTO ASURA DEV GROUP INC 10/01/2012 SEE 04650E NA NA
46 1421601 44920E104 44920E WESTMOUNTAIN INDEX ADVISOR INC IA GLOBAL INC NA NA MERGED INTO ASURA DEV GROUP INC 10/01/2012 SEE 04650E NA NA
47 1421601 96110W203 96110W WESTMOUNTAIN GOLD, INC. WESTMOUNTAIN INDEX ADVISOR INC NA NA NAME CHANGED TO WESTMOUNTAIN GOLD INC  02/28/2012 SEE 96111A NA
48 1421601 96110W203 96110W WESTMOUNTAIN INDEX ADVISOR INC WESTMOUNTAIN INDEX ADVISOR INC NA NA NAME CHANGED TO WESTMOUNTAIN GOLD INC  02/28/2012 SEE 96111A NA

The soft matches are

> m9_issuers %>% filter(cik %in% soft_match_ciks)
# A tibble: 14 x 11
       cik cusip   cusip6 company_name                      issuer_name_1        issuer_name_2 issuer_name_3 issuer_adl_1        issuer_adl_2 issuer_adl_3 issuer_adl_4
     <int> <chr>   <chr>  <chr>                             <chr>                <chr>         <chr>         <chr>               <chr>        <chr>        <chr>       
 1 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 2 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 3 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 4 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 5 1006249 464287… 464287 BARCLAYS GLOBAL FUND ADVISORS     ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 6 1006249 464287… 464287 BLACKROCK FUND ADVISORS           ISHARES TR           NA            NA            FOR FUTURE ISSUES … NA           NA           NA          
 7 1117603 29081M… 29081M EMBRAER BRAZILIAN AVIATION CO INC EMBRAER-EMPRESA BRA… AERONAUTICA … NA            NAME CHANGED TO EM… 11/23/2010   SEE 29082A   NA          
 8 1117603 29081M… 29081M EMBRAER BRAZILIAN AVIATION CO     EMBRAER-EMPRESA BRA… AERONAUTICA … NA            NAME CHANGED TO EM… 11/23/2010   SEE 29082A   NA          
 9 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIATION CO INC EMBRAER S A          NA            NA            NA                  NA           NA           NA          
10 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIATION CO     EMBRAER S A          NA            NA            NA                  NA           NA           NA          
11 1383054 73936B… 73936B INVESCO DB SILVER FUND            POWERSHARES DB MULT… COMMODITY TR  NA            NA                  NA           NA           NA          
12 1383054 73936B… 73936B POWERSHARES DB SILVER FUND        POWERSHARES DB MULT… COMMODITY TR  NA            NA                  NA           NA           NA          
13 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANTAGED GLOB… HANCOCK JOHN INVT TR NA            NA            NA                  NA           NA           NA          
14 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANTAGED GLOB… HANCOCK JOHN INVT TR NA            NA            NA                  NA           NA           N

So it seems we have a wrong match error rate of around 2-3 percent, even with the constraint of the cusips having a multiplicity of at least 10. As you can see in the bad matches, some company names from issuers are quite common (like Wachovia Corp). Here, these common filers seem to be usually the filer of the filing, not the subject. So perhaps we should be extracting the filer cik and company_name along with those of the subject, to enable the correct comparison.

iangow commented 4 years ago

Anything that requires parsing should be done later.

iangow commented 4 years ago

@bdcallen Please fix this issue so that the code is valid reprex code, not incomplete code that cannot be re-run. Also, please point the new code at cusip_cik_test, not cusip_cik, so that we're only addressing issues that remain in the data after resolving earlier issues.

bdcallen commented 4 years ago

@iangow So I have spent some time learning the reprex package. Here's the code, pointed at cusip_cik_test. Note that I have included bad_match_cusips and soft_match_cusips this time, which I should have done previously in the tables above

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

cusip_cik_test <- 
  tbl(pg, sql('SELECT * FROM edgar.cusip_cik_test')) %>% 
  group_by(cik, cusip) %>% 
  mutate(multiplicity = n()) %>% 
  # Add multiplicity as a column
  # Add cusip_length as a column
  mutate(cusip_length = nchar(cusip))%>%
  # Add cusip6 as a column
  mutate(cusip6 = substr(cusip, 1, 6)) %>%
  group_by(cik)

issuers <- 
  tbl(pg, sql('SELECT * FROM cusipm.issuer')) %>% 
  rename(cusip6 = issuer_num) # change issuer_num to cusip6

ciks <- tbl(pg, sql('SELECT * FROM edgar.ciks'))

cusip9s <- cusip_cik_test %>% filter(cusip_length == 9)
m9_issuers <- 
  cusip9s %>%
  filter(multiplicity >= 10 &
           substr(cusip, 9, 9) == as.character(check_digit)) %>%
  distinct(cik, cusip, cusip6) %>% inner_join(issuers) %>%
  left_join(ciks) %>%
  distinct(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2,
           issuer_name_3, issuer_adl_1, issuer_adl_2, issuer_adl_3, issuer_adl_4)
#> Joining, by = "cusip6"
#> Joining, by = "cik"

cusip9s %>% 
  filter(multiplicity >= 10,
         substr(cusip, 9, 9) == as.character(check_digit)) %>%
  distinct(cik, cusip, cusip6) %>% 
  count()
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>      cik n      
#>    <int> <int64>
#>  1    20 1      
#>  2  1750 1      
#>  3  1800 1      
#>  4  1923 2      
#>  5  1961 1      
#>  6  1985 1      
#>  7  2034 1      
#>  8  2062 1      
#>  9  2070 1      
#> 10  2098 1      
#> # … with more rows

m9_issuers %>% 
  distinct(cusip, cik) %>%
  count()
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>      cik n      
#>    <int> <int64>
#>  1    20 1      
#>  2  1750 1      
#>  3  1800 1      
#>  4  1923 2      
#>  5  1961 1      
#>  6  1985 1      
#>  7  2034 1      
#>  8  2062 1      
#>  9  2070 1      
#> 10  2098 1      
#> # … with more rows

bad_match_ciks <- c(839470, 37643, 842638, 1107421, 1013785, 
                    1267753, 798738, 930548, 1421601, 30697)

bad_match_cusips <- c('203744107', '929903102', '238108203',
                      '89365K206', '23077R100', '21988G619',
                      '929903102', '94856P102', '44920E104', 
                      '769667106')

bad_match_df <- tibble(cik = bad_match_ciks, cusip = bad_match_cusips)

soft_match_ciks <- c(1396502, 1383054, 1117603, 1006249)

soft_match_cusips <- c('41013P749', '73936B309',
                       '29082A107', '464287481')

soft_match_df <- tibble(cik = soft_match_ciks, 
                        cusip = soft_match_cusips)

m9_issuers %>% 
  inner_join(bad_match_df, copy = TRUE) %>% 
  select(cik, cusip, cusip6, company_name,
         issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cik
#>       cik cusip  cusip6 company_name              issuer_name_1    issuer_name_2
#>     <int> <chr>  <chr>  <chr>                     <chr>            <chr>        
#> 1  839470 20374… 203744 URANIUM RESOURCES INC /D… COMMUNITY MED T… <NA>         
#> 2  839470 20374… 203744 WESTWATER RESOURCES, INC. COMMUNITY MED T… <NA>         
#> 3 1267753 21988… 21988G LEHMAN ABS CORP BCKD TR … CORPORATE BACKE… <NA>

m9_issuers %>% 
  inner_join(soft_match_df, copy = TRUE) %>% 
  select(cik, cusip, cusip6, company_name, 
         issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cik
#>       cik cusip   cusip6 company_name             issuer_name_1    issuer_name_2
#>     <int> <chr>   <chr>  <chr>                    <chr>            <chr>        
#> 1 1006249 464287… 464287 BARCLAYS GLOBAL FUND AD… ISHARES TR       <NA>         
#> 2 1006249 464287… 464287 BLACKROCK FUND ADVISORS  ISHARES TR       <NA>         
#> 3 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A      <NA>         
#> 4 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A      <NA>         
#> 5 1383054 73936B… 73936B INVESCO DB SILVER FUND   POWERSHARES DB … COMMODITY TR 
#> 6 1383054 73936B… 73936B POWERSHARES DB SILVER F… POWERSHARES DB … COMMODITY TR 
#> 7 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>         
#> 8 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>

Created on 2020-06-21 by the reprex package (v0.3.0)

I still have the previously defined m9_issuers in my RStudio, here's what you get with the bad and soft match cusips added.

> m9_issuers %>% inner_join(bad_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 15 x 6
       cik cusip     cusip6 company_name                                           issuer_name_1            issuer_name_2
     <dbl> <chr>     <chr>  <chr>                                                  <chr>                    <chr>        
 1   30697 769667106 769667 TRIARC COMPANIES INC                                   RIVUS BD FD              NA           
 2   30697 769667106 769667 WENDY'S CO                                             RIVUS BD FD              NA           
 3   30697 769667106 769667 WENDY'S/ARBY'S GROUP, INC.                             RIVUS BD FD              NA           
 4   37643 929903102 929903 FLORIDA PUBLIC UTILITIES CO                            WACHOVIA CORP NEW        NA           
 5  798738 929903102 929903 SCUDDER NEW ASIA FUND INC                              WACHOVIA CORP NEW        NA           
 6  839470 203744107 203744 URANIUM RESOURCES INC /DE/                             COMMUNITY MED TRANS INC  NA           
 7  839470 203744107 203744 WESTWATER RESOURCES, INC.                              COMMUNITY MED TRANS INC  NA           
 8  842638 238108203 238108 VERSUS TECHNOLOGY INC                                  DATARAM CORP             NA           
 9  930548 94856P102 94856P RECKSON ASSOCIATES REALTY CORP                         WEEKS CORP               NA           
10 1013785 23077R100 23077R GOLDBELT RESOURCES LTD                                 CUMBERLAND RES LTD       NA           
11 1107421 89365K206 89365K EASYWEB INC                                            TRANSGENOMIC INC         NA           
12 1107421 89365K206 89365K ZIOPHARM ONCOLOGY INC                                  TRANSGENOMIC INC         NA           
13 1267753 21988G619 21988G LEHMAN ABS CORP BCKD TR CRTS TOYS R US DB BCK SE 01-31 CORPORATE BACKED TR CTFS NA           
14 1421601 44920E104 44920E WESTMOUNTAIN GOLD, INC.                                IA GLOBAL INC            NA           
15 1421601 44920E104 44920E WESTMOUNTAIN INDEX ADVISOR INC                         IA GLOBAL INC            NA           
> m9_issuers %>% inner_join(soft_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 8 x 6
      cik cusip     cusip6 company_name                                              issuer_name_1               issuer_name_2
    <dbl> <chr>     <chr>  <chr>                                                     <chr>                       <chr>        
1 1006249 464287481 464287 BARCLAYS GLOBAL FUND ADVISORS                             ISHARES TR                  NA           
2 1006249 464287481 464287 BLACKROCK FUND ADVISORS                                   ISHARES TR                  NA           
3 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO INC                         EMBRAER S A                 NA           
4 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO                             EMBRAER S A                 NA           
5 1383054 73936B309 73936B INVESCO DB SILVER FUND                                    POWERSHARES DB MULTI-SECTOR COMMODITY TR 
6 1383054 73936B309 73936B POWERSHARES DB SILVER FUND                                POWERSHARES DB MULTI-SECTOR COMMODITY TR 
7 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL SHAREHOLDER YIELD FUND HANCOCK JOHN INVT TR        NA           
8 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL YIELD FUND             HANCOCK JOHN INVT TR        NA     

So indeed, some of the bad matches have gone away in cusip_cik_text

iangow commented 4 years ago

@bdcallen Add confirmed "bad matches" to the spreadsheet here. Perhaps add a bad_matches tab. Please include details in a column explaining why a match is bad (e.g., "filer used its own CUSIP rather than issuer's"). We can sort out the process for incorporating this information later. Only include matches in the cusip_cik_test table, not ones that have already been handled elsewhere.

bdcallen commented 4 years ago

@iangow Ok, I have committed my jupyter notebook, handle_cusip_cik_exceptions.ipynb, which is designed to be used to update a table I made earlier called cusip_cik_exceptions. It contains code to define dataframes which I used to help identify the wrong matches; in this issue the most important ones are valid9s_above_10_w_issuers, which contains the pairs which can join onto cusipm.issuer (and perhaps crsp.stocknames as well), and nines_w_stocknames, which contains the pairs which can join onto crsp.stocknames but not cusipm.issuer. With the former, I calculated a variable called sim_index_norm, which is the an index between 0 and 1 based on the Levenshtein distance between the company names in valid9s_above_10_w_issuers and the corresponding names in cusipm.issuer, where 1 means a complete string match, and 0 a complete non-match. I then calculated sim_index_max, which is the maximum of sim_index_norm when grouping by (cik, cusip). I then choose to look at the rows in

valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8)

I then did something similar for nines_w_stocknames. The extended details of how I did this and more is rather lengthy, and in my opinion, more appropriately covered in some documentation on cusip_cik_exceptions. So for the purposes of this post, I would like to just state where you can find my result. The rows which correspond to the 9-digit cusips analysed can be found with the query

SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 9

Here, cusip_raw is always the raw cusip from cusip_cik_test, which in this case is just equal to cusip (for the some of the cases covered in some of the other issues on 6,7 and 8 digit cusips, this is not the case. For instance in cases where I analysed cusips left-padded with a zero, cusip is the modified cusip. I'll explain more in the other issues).

One of the most important columns here is the field valid_match, which is set to TRUE if a match is correct, FALSE if incorrect, and NULL if undecided. I chose to include the TRUE cases here from those analysed so that in future we do not have to cover the same ground, ie. we can do

valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8) %>% anti_join(cusip_cik_exceptions)

I'm going to make similar posts on issues #86, #87 and #88, and then open a new issue for making documentation on cusip_cik_exceptions.

I believe we can close this issue now.