Closed bdcallen closed 4 years ago
Anything that requires parsing should be done later.
@bdcallen Please fix this issue so that the code is valid reprex
code, not incomplete code that cannot be re-run. Also, please point the new code at cusip_cik_test
, not cusip_cik
, so that we're only addressing issues that remain in the data after resolving earlier issues.
@iangow So I have spent some time learning the reprex package. Here's the code, pointed at cusip_cik_test
. Note that I have included bad_match_cusips
and soft_match_cusips
this time, which I should have done previously in the tables above
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
cusip_cik_test <-
tbl(pg, sql('SELECT * FROM edgar.cusip_cik_test')) %>%
group_by(cik, cusip) %>%
mutate(multiplicity = n()) %>%
# Add multiplicity as a column
# Add cusip_length as a column
mutate(cusip_length = nchar(cusip))%>%
# Add cusip6 as a column
mutate(cusip6 = substr(cusip, 1, 6)) %>%
group_by(cik)
issuers <-
tbl(pg, sql('SELECT * FROM cusipm.issuer')) %>%
rename(cusip6 = issuer_num) # change issuer_num to cusip6
ciks <- tbl(pg, sql('SELECT * FROM edgar.ciks'))
cusip9s <- cusip_cik_test %>% filter(cusip_length == 9)
m9_issuers <-
cusip9s %>%
filter(multiplicity >= 10 &
substr(cusip, 9, 9) == as.character(check_digit)) %>%
distinct(cik, cusip, cusip6) %>% inner_join(issuers) %>%
left_join(ciks) %>%
distinct(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2,
issuer_name_3, issuer_adl_1, issuer_adl_2, issuer_adl_3, issuer_adl_4)
#> Joining, by = "cusip6"
#> Joining, by = "cik"
cusip9s %>%
filter(multiplicity >= 10,
substr(cusip, 9, 9) == as.character(check_digit)) %>%
distinct(cik, cusip, cusip6) %>%
count()
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> cik n
#> <int> <int64>
#> 1 20 1
#> 2 1750 1
#> 3 1800 1
#> 4 1923 2
#> 5 1961 1
#> 6 1985 1
#> 7 2034 1
#> 8 2062 1
#> 9 2070 1
#> 10 2098 1
#> # … with more rows
m9_issuers %>%
distinct(cusip, cik) %>%
count()
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> cik n
#> <int> <int64>
#> 1 20 1
#> 2 1750 1
#> 3 1800 1
#> 4 1923 2
#> 5 1961 1
#> 6 1985 1
#> 7 2034 1
#> 8 2062 1
#> 9 2070 1
#> 10 2098 1
#> # … with more rows
bad_match_ciks <- c(839470, 37643, 842638, 1107421, 1013785,
1267753, 798738, 930548, 1421601, 30697)
bad_match_cusips <- c('203744107', '929903102', '238108203',
'89365K206', '23077R100', '21988G619',
'929903102', '94856P102', '44920E104',
'769667106')
bad_match_df <- tibble(cik = bad_match_ciks, cusip = bad_match_cusips)
soft_match_ciks <- c(1396502, 1383054, 1117603, 1006249)
soft_match_cusips <- c('41013P749', '73936B309',
'29082A107', '464287481')
soft_match_df <- tibble(cik = soft_match_ciks,
cusip = soft_match_cusips)
m9_issuers %>%
inner_join(bad_match_df, copy = TRUE) %>%
select(cik, cusip, cusip6, company_name,
issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups: cik
#> cik cusip cusip6 company_name issuer_name_1 issuer_name_2
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 839470 20374… 203744 URANIUM RESOURCES INC /D… COMMUNITY MED T… <NA>
#> 2 839470 20374… 203744 WESTWATER RESOURCES, INC. COMMUNITY MED T… <NA>
#> 3 1267753 21988… 21988G LEHMAN ABS CORP BCKD TR … CORPORATE BACKE… <NA>
m9_issuers %>%
inner_join(soft_match_df, copy = TRUE) %>%
select(cik, cusip, cusip6, company_name,
issuer_name_1, issuer_name_2)
#> Joining, by = c("cik", "cusip")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups: cik
#> cik cusip cusip6 company_name issuer_name_1 issuer_name_2
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1006249 464287… 464287 BARCLAYS GLOBAL FUND AD… ISHARES TR <NA>
#> 2 1006249 464287… 464287 BLACKROCK FUND ADVISORS ISHARES TR <NA>
#> 3 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A <NA>
#> 4 1117603 29082A… 29082A EMBRAER BRAZILIAN AVIAT… EMBRAER S A <NA>
#> 5 1383054 73936B… 73936B INVESCO DB SILVER FUND POWERSHARES DB … COMMODITY TR
#> 6 1383054 73936B… 73936B POWERSHARES DB SILVER F… POWERSHARES DB … COMMODITY TR
#> 7 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>
#> 8 1396502 41013P… 41013P JOHN HANCOCK TAX-ADVANT… HANCOCK JOHN IN… <NA>
Created on 2020-06-21 by the reprex package (v0.3.0)
I still have the previously defined m9_issuers
in my RStudio, here's what you get with the bad and soft match cusips added.
> m9_issuers %>% inner_join(bad_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 15 x 6
cik cusip cusip6 company_name issuer_name_1 issuer_name_2
<dbl> <chr> <chr> <chr> <chr> <chr>
1 30697 769667106 769667 TRIARC COMPANIES INC RIVUS BD FD NA
2 30697 769667106 769667 WENDY'S CO RIVUS BD FD NA
3 30697 769667106 769667 WENDY'S/ARBY'S GROUP, INC. RIVUS BD FD NA
4 37643 929903102 929903 FLORIDA PUBLIC UTILITIES CO WACHOVIA CORP NEW NA
5 798738 929903102 929903 SCUDDER NEW ASIA FUND INC WACHOVIA CORP NEW NA
6 839470 203744107 203744 URANIUM RESOURCES INC /DE/ COMMUNITY MED TRANS INC NA
7 839470 203744107 203744 WESTWATER RESOURCES, INC. COMMUNITY MED TRANS INC NA
8 842638 238108203 238108 VERSUS TECHNOLOGY INC DATARAM CORP NA
9 930548 94856P102 94856P RECKSON ASSOCIATES REALTY CORP WEEKS CORP NA
10 1013785 23077R100 23077R GOLDBELT RESOURCES LTD CUMBERLAND RES LTD NA
11 1107421 89365K206 89365K EASYWEB INC TRANSGENOMIC INC NA
12 1107421 89365K206 89365K ZIOPHARM ONCOLOGY INC TRANSGENOMIC INC NA
13 1267753 21988G619 21988G LEHMAN ABS CORP BCKD TR CRTS TOYS R US DB BCK SE 01-31 CORPORATE BACKED TR CTFS NA
14 1421601 44920E104 44920E WESTMOUNTAIN GOLD, INC. IA GLOBAL INC NA
15 1421601 44920E104 44920E WESTMOUNTAIN INDEX ADVISOR INC IA GLOBAL INC NA
> m9_issuers %>% inner_join(soft_match_df) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_name_2)
Joining, by = c("cik", "cusip")
# A tibble: 8 x 6
cik cusip cusip6 company_name issuer_name_1 issuer_name_2
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1006249 464287481 464287 BARCLAYS GLOBAL FUND ADVISORS ISHARES TR NA
2 1006249 464287481 464287 BLACKROCK FUND ADVISORS ISHARES TR NA
3 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO INC EMBRAER S A NA
4 1117603 29082A107 29082A EMBRAER BRAZILIAN AVIATION CO EMBRAER S A NA
5 1383054 73936B309 73936B INVESCO DB SILVER FUND POWERSHARES DB MULTI-SECTOR COMMODITY TR
6 1383054 73936B309 73936B POWERSHARES DB SILVER FUND POWERSHARES DB MULTI-SECTOR COMMODITY TR
7 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL SHAREHOLDER YIELD FUND HANCOCK JOHN INVT TR NA
8 1396502 41013P749 41013P JOHN HANCOCK TAX-ADVANTAGED GLOBAL YIELD FUND HANCOCK JOHN INVT TR NA
So indeed, some of the bad matches have gone away in cusip_cik_text
@bdcallen Add confirmed "bad matches" to the spreadsheet here. Perhaps add a bad_matches
tab. Please include details in a column explaining why a match is bad (e.g., "filer used its own CUSIP rather than issuer's"). We can sort out the process for incorporating this information later. Only include matches in the cusip_cik_test
table, not ones that have already been handled elsewhere.
@iangow Ok, I have committed my jupyter notebook, handle_cusip_cik_exceptions.ipynb
, which is designed to be used to update a table I made earlier called cusip_cik_exceptions
. It contains code to define dataframes which I used to help identify the wrong matches; in this issue the most important ones are valid9s_above_10_w_issuers
, which contains the pairs which can join onto cusipm.issuer
(and perhaps crsp.stocknames
as well), and nines_w_stocknames
, which contains the pairs which can join onto crsp.stocknames
but not cusipm.issuer
. With the former, I calculated a variable called sim_index_norm
, which is the an index between 0 and 1 based on the Levenshtein distance between the company names in valid9s_above_10_w_issuers
and the corresponding names in cusipm.issuer
, where 1 means a complete string match, and 0 a complete non-match. I then calculated sim_index_max
, which is the maximum of sim_index_norm
when grouping by (cik, cusip). I then choose to look at the rows in
valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8)
I then did something similar for nines_w_stocknames
. The extended details of how I did this and more is rather lengthy, and in my opinion, more appropriately covered in some documentation on cusip_cik_exceptions
. So for the purposes of this post, I would like to just state where you can find my result. The rows which correspond to the 9-digit cusips analysed can be found with the query
SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 9
Here, cusip_raw
is always the raw cusip from cusip_cik_test
, which in this case is just equal to cusip
(for the some of the cases covered in some of the other issues on 6,7 and 8 digit cusips, this is not the case. For instance in cases where I analysed cusips left-padded with a zero, cusip
is the modified cusip. I'll explain more in the other issues).
One of the most important columns here is the field valid_match
, which is set to TRUE if a match is correct, FALSE if incorrect, and NULL if undecided. I chose to include the TRUE cases here from those analysed so that in future we do not have to cover the same ground, ie. we can do
valid9s_above_10_w_issuers %>% filter(sim_index_max < 0.8) %>% anti_join(cusip_cik_exceptions)
I'm going to make similar posts on issues #86, #87 and #88, and then open a new issue for making documentation on cusip_cik_exceptions
.
I believe we can close this issue now.
@iangow Last night, I did a random sample of 500 cik-cusip combinations with valid 9-digit cusips, and then joined them to issuers to see how many were good matches and how many bad.
Have looked through the whole, set, I found 10 clearly bad matches (from bad_match_ciks), and 4 toss-ups (in soft_match_ciks).
Doing
one gets
The soft matches are
So it seems we have a wrong match error rate of around 2-3 percent, even with the constraint of the cusips having a multiplicity of at least 10. As you can see in the bad matches, some company names from
issuers
are quite common (like Wachovia Corp). Here, these common filers seem to be usually the filer of the filing, not the subject. So perhaps we should be extracting the filer cik and company_name along with those of the subject, to enable the correct comparison.