Closed iangow closed 4 years ago
@iangow
> cusip_cik %>% distinct(cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 26898
> cusip_cik %>% filter(cusip_length == 9) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 25033
On a simple level, we lose matchings to around 1800 ciks. Of course this doesn't take into account the frequency yet.
@iangow The discrepancy is reduced once we take into account the cases with a multiplicity more than or equal to 10
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 17618
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 16983
So around 635 are lost, if we restrict to 9 digits.
@iangow Here's the analogous work with cusips
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip) %>% count()
# A tibble: 1 x 1
n
<int>
1 22222
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip) %>% count()
# A tibble: 1 x 1
n
<int>
1 23334
@iangow Here's what we lose for cik-cusip pairs with multiplicities more than 10
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 26548
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 22869
@iangow Here's the distinct cik-cusip6 pair analysis
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip6, cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 20471
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip6, cik) %>% count()
# A tibble: 1 x 1
n
<int>
1 23488
Going by cusip6, we reduce what's lost to around 3000 from around 3700 for looking at the full cusips.
@iangow
> dist_cus6_w9 <- cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip6, cik)
> dist_cus6 <- cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip6, cik)
> dist_cus6 %>% anti_join(dist_cus6_w9)
Joining, by = c("cik", "cusip6")
# A tibble: 3,017 x 2
cik cusip6
<int> <chr>
1 20 NA
2 2034 NA
3 2135 NA
4 2488 NA
5 2491 364654
6 2491 NA
7 2809 NA
8 3000 NA
9 3327 NA
10 3449 NA
# … with 3,007 more rows
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% count()
Joining, by = c("cik", "cusip6")
# A tibble: 1 x 1
n
<int>
1 3017
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(is.na(cusip6)) %>% count()
Joining, by = c("cik", "cusip6")
# A tibble: 1 x 1
n
<int>
1 2544
So the vast majority of the cusip6's eliminated by filtering for cases which have a cusip9 with a multiplicity more than 10 are cases where cusip6 is NA.
@iangow
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(!is.na(cusip6)) %>% inner_join(ciks) %>% inner_join(issuers)
Joining, by = c("cik", "cusip6")
Joining, by = "cik"
Joining, by = "cusip6"
# A tibble: 341 x 25
cik cusip6 company_name issuer_check issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1 issuer_adl_2 issuer_adl_3 issuer_adl_4 issuer_sort_key issuer_type
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2491 364654 ALLIANCE GA… 4 GAMING & TEC… NA NA NAME CHANGE… INC 12/08/1… NA NA GAMING & TECHN… C
2 2491 364654 BALLY TECHN… 4 GAMING & TEC… NA NA NAME CHANGE… INC 12/08/1… NA NA GAMING & TECHN… C
3 6814 544118 COMFORCE CO… 3 LORI CORP NA NA REORGANIZED… 12/01/1995 NA NA LORI CORP … C
4 6814 544118 LORI CORP 3 LORI CORP NA NA REORGANIZED… 12/01/1995 NA NA LORI CORP … C
5 8038 046352 ASTREX INC 1 ASTRALIS LTD NA NA FORMERLY AS… PHARMACEUTI… NA NA ASTRALIS LTD … C
6 16496 131521 CALPROP CORP 7 CALVERT CASH… NA NA NAME CHANGE… PLUS 09/10/… NA NA CALVERT CASH R… C
7 26537 233712 ARGON ST, I… 9 DAEDALUS ENT… NA NA NAME CHANGE… TECHNOLOGIE… SEE 81726S NA DAEDALUS ENTER… C
8 26537 233712 DAEDALUS EN… 9 DAEDALUS ENT… NA NA NAME CHANGE… TECHNOLOGIE… SEE 81726S NA DAEDALUS ENTER… C
9 26537 233712 SENSYS TECH… 9 DAEDALUS ENT… NA NA NAME CHANGE… TECHNOLOGIE… SEE 81726S NA DAEDALUS ENTER… C
10 26537 233712 SENSYTECH I… 9 DAEDALUS ENT… NA NA NAME CHANGE… TECHNOLOGIE… SEE 81726S NA DAEDALUS ENTER… C
# … with 331 more rows, and 12 more variables: issuer_status <chr>, issuer_del_date <date>, issuer_transaction <chr>, issuer_state_code <chr>,
# issuer_update_date <date>, cabre_id <chr>, cabre_status <chr>, lei_cici <chr>, legal_entity_name <chr>, previous_name <chr>, entry_date <date>,
# cp_institution_type <chr>
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(!is.na(cusip6)) %>% inner_join(ciks) %>% inner_join(issuers) %>% distinct(cik, cusip6)
Joining, by = c("cik", "cusip6")
Joining, by = "cik"
Joining, by = "cusip6"
# A tibble: 188 x 2
cik cusip6
<int> <chr>
1 2491 364654
2 6814 544118
3 8038 046352
4 16496 131521
5 26537 233712
6 34408 307001
7 59963 254745
8 70415 637130
9 72843 665262
10 74154 682678
# … with 178 more rows
So around a third of the remaining cases join onto the issuers table by cusip6. Next we want to check name matching.
@iangow Just a reminder, in my work on this, the issuers
dataframe is essentially cusipm.issuers
, with the field issuer_num
renamed to cusip6
for the convenience of joining tables of interest easily.
@bdcallen We don't seem to be making progress here (on this whole CUSIP-CIK project). So please put it aside for now and I will take a look at it when I get a chance. (I don't follow how your code is building on the code I posted above. This was [deliberately] meant to be a very focused issue asking a narrow question.)
OK. So it seems that in 9 of 10 cases examined, throwing out the eight-digit CUSIPs from a filing where there is a valid nine-digit CUSIP is either harmless (because the eight-digit CUSIP is the same except for the check digit) or helpful (because the eight-digit CUSIP is wrong).
The one exception from the sample is the CUSIP associated with Dimark Inc below. The nine-digit CUSIP recovered there appears to be another case where the filing entity mistakenly supplied its own CUSIP. There is no way to fix this through parsing (we have correctly parsed a bad filing).
So on the basis of the above, we should delete all other CUSIPs in filings where there is a valid nine-digit CUSIP. We could do that in the Python code (once we get back to it).
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik <- tbl(pg, "cusip_cik")
valid_cusip9s <-
cusip_cik %>%
filter(nchar(cusip) == 9) %>%
filter(substr(cusip, 9, 9) == as.character(check_digit)) %>%
compute()
other_cusips_8 <-
cusip_cik %>%
semi_join(valid_cusip9s, by = "file_name") %>%
anti_join(valid_cusip9s, by = c("file_name", "cusip")) %>%
filter(nchar(cusip) == 8L) %>%
compute()
other_cusips_8 %>%
select(file_name)
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name
#> <chr>
#> 1 edgar/data/752714/0001089355-00-000190.txt
#> 2 edgar/data/894253/0000950134-00-001129.txt
#> 3 edgar/data/932290/0000922423-99-000556.txt
#> 4 edgar/data/110471/0000922907-00-000188.txt
#> 5 edgar/data/1197708/0000950152-02-007550.txt
#> 6 edgar/data/883899/0000950134-96-000458.txt
#> 7 edgar/data/1052752/0001144204-19-007744.txt
#> 8 edgar/data/1345523/0001019687-13-004045.txt
#> 9 edgar/data/754737/0000754737-04-000035.txt
#> 10 edgar/data/1308161/0001193125-04-195185.txt
#> # … with more rows
cusip_cik %>% filter(file_name == "edgar/data/752714/0001089355-00-000190.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/752714/000108935… 580589… 9 752714 MCGRATH RENTC… AB
#> 2 edgar/data/752714/000108935… 505891… 3 752714 MCGRATH RENTC… C
cusip_cik %>% filter(file_name == "edgar/data/894253/0000950134-00-001129.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/894253/00009501… 714265… 5 894253 PEROT SYSTEMS … ABD
#> 2 edgar/data/894253/00009501… 142651… 8 894253 PEROT SYSTEMS … C
cusip_cik %>% filter(file_name == "edgar/data/932290/0000922423-99-000556.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/932290/000092242… 886027… 1 932290 THRUSTMASTER … A
#> 2 edgar/data/932290/000092242… 886027… 1 932290 THRUSTMASTER … C
cusip_cik %>% filter(file_name == "edgar/data/110471/0000922907-00-000188.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/110471/00009… 97809… 3 110471 WOLVERINE WORLD WI… ABC
#> 2 edgar/data/110471/00009… 97809… 1 110471 WOLVERINE WORLD WI… D
cusip_cik %>% filter(file_name == "edgar/data/1197708/0000950152-02-007550.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/1197708/000095… 892081… 0 906110 TOWN & COUNTRY … A
#> 2 edgar/data/1197708/000095… 892081… 1 906110 TOWN & COUNTRY … C
cusip_cik %>% filter(file_name == "edgar/data/883899/0000950134-96-000458.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/883899/0000950134-… 571660… 9 883899 DIMARK INC A
#> 2 edgar/data/883899/0000950134-… 416196… 3 883899 DIMARK INC C
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
stocknames %>% filter(ncusip %in% c("57166010", "41619610"))
#> # Source: lazy query [?? x 16]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd
#> <int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 70148 8483 1986-08-14 1986-08-18 2542… 57166… MGSI MARS … 2 3
#> 2 70148 8483 1986-08-19 1992-04-26 2542… 57166… WMD MARS … 2 2
#> 3 79903 30030 1993-11-04 1998-05-05 4161… 41619… HHS HARTE… 1 1
#> 4 79903 30030 1998-05-06 2018-01-31 4161… 41619… HHS HARTE… 1 1
#> # … with 6 more variables: siccd <int64>, shrcd <int64>, shrcls <chr>,
#> # st_date <date>, end_date <date>, namedum <dbl>
filings <- tbl(pg, "filings")
filings %>% filter(cik == 883899) %>% select(company_name) %>% distinct()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> company_name
#> <chr>
#> 1 DIMARK INC
cusip_cik %>% filter(file_name == "edgar/data/1052752/0001144204-19-007744.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/1052752/00011… 374297… 9 1052752 GETTY REALTY CO… C
#> 2 edgar/data/1052752/00011… 374297… 9 1052752 GETTY REALTY CO… ABD
cusip_cik %>% filter(file_name == "edgar/data/1345523/0001019687-13-004045.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/1345523/00010… 636375… 7 1023844 NATIONAL HOLDIN… ABC
#> 2 edgar/data/1345523/00010… 363751… 4 1023844 NATIONAL HOLDIN… D
cusip_cik %>% filter(file_name == "edgar/data/754737/0000754737-04-000035.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/754737/0000754737-… 499183… 4 754737 SCANA CORP ACD
#> 2 edgar/data/754737/0000754737-… 499183… 4 754737 SCANA CORP C
cusip_cik %>% filter(file_name == "edgar/data/1308161/0001193125-04-195185.txt")
#> # Source: lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#> file_name cusip check_digit cik company_name formats
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 edgar/data/1308161/0001… 35138T… 7 1.07e6 FOX ENTERTAINMENT… ABD
#> 2 edgar/data/1308161/0001… 3513T1… 1 1.07e6 FOX ENTERTAINMENT… C
Created on 2020-06-19 by the reprex package (v0.3.0)
For now, we should make a "new" version of cusip_cik
(say cusip_cik_test
) that implements this and evaluate remaining issues (e.g., #90, #86, #87) against that version.
By "giving primacy", I mean dropping everything else from any filing that has a valid 9-digit CUSIP.
Below is some starter code. It seems that we don't have any filings with multiple valid 9-digit CUSIPs, so this might help to solve the "multiple CUSIPs" issue (see #77).
Created on 2020-04-22 by the reprex package (v0.3.0)