Closed bdcallen closed 4 years ago
@bdcallen Just FYI, changed my status for this repository to "not watching". So if you want me to see a comment here, just "mention" me. But this means you can structure comments on issues without worrying about it pinging me, etc. (eg, you can edit, delete comments).
> cusip_cik %>% filter(cusip_length == 7) %>% distinct(cik, cusip) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 613
> cusip_cik %>% filter(cusip_length == 7) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>% filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 106
> cusip_cik %>%
> filter(cusip_length == 7) %>%
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>%
> filter(cusip_length.y == 9) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 19
> cusip_cik %>% filter(cusip_length == 7) %>% inner_join(cusip_cik, by = c('file_name', 'cik')) %>% filter(cusip_length.y == 8) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 47
> cusip_cik %>% filter(cusip_length == 7) %>%
> inner_join(cusip_cik, by = c('file_name', 'cik', 'cusip6')) %>%
> filter(cusip_length.y == 8) %>%
> distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.7 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 14
> cusip7s <- cusip_cik %>% filter(cusip_length == 7)
> cusip7s
# A tibble: 3,833 x 20
cik cusip6 n_cusip6 cusip multiplicity file_name check_digit company_name formats cusip_length num_cusip_uniqu… cusip8 cusip7_1 cusip6_2 cusip5_3 cusip4_4
<int> <chr> <int> <chr> <int> <chr> <int> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2062 461620 2 4616… 2 edgar/da… 7 ACMAT CORP ACD 7 6 46162… 4616207 461620 461627 461607
2 2062 461620 2 4616… 2 edgar/da… 7 ACMAT CORP ACD 7 6 46162… 4616207 461620 461627 461607
3 2145 504110 2 5041… 2 edgar/da… 8 ACTION INDU… C 7 5 50411… 5041108 504110 504118 504108
4 2145 504110 2 5041… 2 edgar/da… 8 ACTION INDU… C 7 5 50411… 5041108 504110 504118 504108
5 2491 364541 2 3645… 2 edgar/da… 3 ALLIANCE GA… C 7 6 36454… 3645410 364541 364540 364510
6 2491 364541 2 3645… 2 edgar/da… 3 ALLIANCE GA… C 7 6 36454… 3645410 364541 364540 364510
7 2809 847410 6 8474… 6 edgar/da… 8 AGNICO EAGL… ACD 7 5 84741… 8474108 847410 847418 847408
8 2809 847410 6 8474… 6 edgar/da… 8 AGNICO EAGL… ACD 7 5 84741… 8474108 847410 847418 847408
9 2809 847410 6 8474… 6 edgar/da… 8 AGNICO EAGL… ACD 7 5 84741… 8474108 847410 847418 847408
10 2809 847410 6 8474… 6 edgar/da… 8 AGNICO EAGL… ACD 7 5 84741… 8474108 847410 847418 847408
# … with 3,823 more rows, and 4 more variables: cusip3_5 <chr>, cusip2_6 <chr>, cusip1_7 <chr>, cusip0_8 <chr>
> cusip7s %>% filter(multiplicity >= 10) %>% distinct(cik, cusip) %>% count()
# A tibble: 1 x 1
n
<int>
1 111
> seven_w_89 <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>%
filter(cusip_length.y >= 8) %>% distinct(cik, cusip.x, multiplicity.x) %>%
rename(cusip = cusip.x, multiplicity = multiplicity.x) %>%
filter(multiplicity >= 10) %>% distinct(cik, cusip)
> cusip7s %>% anti_join(seven_w_89) %>% distinct(cik, cusip) %>% count()
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
n
<int>
1 508
> cusip7s %>% anti_join(seven_w_89) %>% filter(multiplicity >= 10) %>% distinct(cik, cusip) %>% count()
Joining, by = c("cik", "cusip")
# A tibble: 1 x 1
n
<int>
1 6
cases_7 <- cusip7s %>% anti_join(seven_w_89) %>%
filter(multiplicity >= 10) %>%
distinct(cik, cusip)
cases_7 <- cases_7 %>% mutate(cusip6 = substr(cusip, 1, 6), cusip_1pad = paste0('0', cusip),
check_digit_1pad = calculate_cusip_check_digit(cusip_1pad),
cusip_2pad = paste0('00', cusip),
check_digit_2pad = calculate_cusip_check_digit(cusip_2pad),
cusip6_1pad = substr(cusip_1pad, 1, 6),
cusip6_2pad = substr(cusip_2pad, 1, 6))
> cases_7 %>% mutate(ncusip = cusip_1pad) %>% inner_join(stocknames)
Joining, by = c("cusip", "ncusip")
# A tibble: 0 x 24
# … with 24 variables: cik <int>, cusip <chr>, cusip6 <chr>, cusip_1pad <chr>, check_digit_1pad <dbl>, cusip_2pad <chr>, check_digit_2pad <dbl>, cusip6_1pad <chr>,
# cusip6_2pad <chr>, ncusip <chr>, permno <int>, permco <int>, namedt <date>, nameenddt <date>, ticker <chr>, comnam <chr>, hexcd <dbl>, exchcd <dbl>, siccd <dbl>,
# shrcd <dbl>, shrcls <chr>, st_date <date>, end_date <date>, namedum <dbl>
> cases_7 %>% mutate(ncusip = substr(cusip_2pad, 1, 8)) %>% inner_join(stocknames)
Joining, by = c("cusip", "ncusip")
# A tibble: 0 x 24
# … with 24 variables: cik <int>, cusip <chr>, cusip6 <chr>, cusip_1pad <chr>, check_digit_1pad <dbl>, cusip_2pad <chr>, check_digit_2pad <dbl>, cusip6_1pad <chr>,
# cusip6_2pad <chr>, ncusip <chr>, permno <int>, permco <int>, namedt <date>, nameenddt <date>, ticker <chr>, comnam <chr>, hexcd <dbl>, exchcd <dbl>, siccd <dbl>,
# shrcd <dbl>, shrcls <chr>, st_date <date>, end_date <date>, namedum <dbl>
> cases_7 %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 5 x 7
cik cusip cusip6 company_name issuer_name_1 issuer_adl_1 issuer_status
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 74702 6838361 683836 OPTICAL RADIATION CORP OPTICAL RADIATION CORP NA A
2 896717 5949490 594949 COMMONWEALTH ASSOCIATES GROWTH FUND INC MICROCAP FD INC NA A
3 896717 5949490 594949 MICROCAP FUND INC MICROCAP FD INC NA A
4 896717 5949490 594949 MICROCAP FUND IND MICROCAP FD INC NA A
5 1022259 74157E1 74157E PRIME SERVICE INC PRIME SVC INC NA A
> cases_7 %>% rename(cusip6_orig = cusip6, cusip6 = cusip6_1pad) %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 6 x 7
cik cusip cusip6 company_name issuer_name_1 issuer_adl_1 issuer_status
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 74702 6838361 068383 OPTICAL RADIATION CORP BARRE TR CO VT MERGED INTO MERCHANTS NATL BK D
2 314590 6777450 067774 SASOL LTD/ADR/ BARNES & NOBLE INC NA A
3 314590 6777450 067774 SASOL LTD BARNES & NOBLE INC NA A
4 896717 5949490 059494 COMMONWEALTH ASSOCIATES GROWTH FUND INC BANC AMER ALTERNATIVE LN TR NA A
5 896717 5949490 059494 MICROCAP FUND INC BANC AMER ALTERNATIVE LN TR NA A
6 896717 5949490 059494 MICROCAP FUND IND BANC AMER ALTERNATIVE LN TR NA A
> cases_7 %>% rename(cusip6_orig = cusip6, cusip6 = cusip6_2pad) %>% inner_join(issuers) %>% left_join(ciks) %>% select(cik, cusip, cusip6, company_name, issuer_name_1, issuer_adl_1, issuer_status)
Joining, by = "cusip6"
Joining, by = "cik"
# A tibble: 1 x 7
cik cusip cusip6 company_name issuer_name_1 issuer_adl_1 issuer_status
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1022259 74157E1 007415 PRIME SERVICE INC ADVANCE COSMETIC & ANTI-AGING NA A
Here is a set of matches with two leading zeros added to the left of the 7 digit cusip.
> View(cusip7s %>% inner_join(cusip_cik, by = 'cik') %>%
+ filter(cusip_length.y == 9 &
cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>%
distinct(cik, cusip.x, cusip.y) %>%
+ filter(grepl('^[0]{2}', cusip.y)) %>%
mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>%
filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>%
mutate(ncusip = substr(cusip.y, 1, 8)) %>%
inner_join(stocknames) %>% left_join(ciks) %>%
distinct(cik, cusip.x, cusip.y, ncusip, company_name, comnam))
cik | cusip.x | cusip.y | ncusip | company_name | comnam | |
---|---|---|---|---|---|---|
1 | 2062 | 4616207 | 004616207 | 00461620 | ACMAT CORP | ACMAT CORP |
2 | 2145 | 5041108 | 005041108 | 00504110 | ACTION INDUSTRIES INC | ACTION INDUSTRIES INC |
3 | 2809 | 8474108 | 008474108 | 00847410 | AGNICO EAGLE MINES LTD | AGNICO EAGLE MINES LTD |
4 | 60911 | 7548100 | 007548100 | 00754810 | ADVANCED TECHNICAL PRODUCTS INC | ADVANCED TECHNICAL PRODUCTS INC |
5 | 60911 | 7548100 | 007548100 | 00754810 | LUNN INDUSTRIES INC /DE/ | ADVANCED TECHNICAL PRODUCTS INC |
6 | 74783 | 4308102 | 004308102 | 00430810 | ACCEPTANCE INSURANCE COMPANIES INC | ACCEPTANCE INSURANCE COS INC |
7 | 319126 | 1058106 | 001058106 | 00105810 | AFP IMAGING CORP | A F P IMAGING CORP |
8 | 353184 | 9224304 | 009224304 | 00922430 | AIR T INC | AIR TRANSPORTATION HOLDING INC |
9 | 353184 | 9224304 | 009224304 | 00922430 | AIR TRANSPORTATION HOLDING CO INC | AIR TRANSPORTATION HOLDING INC |
10 | 718877 | 4930202 | 004930202 | 00493020 | ACTIVISION BLIZZARD, INC. | ACTIVISION INC NEW |
11 | 718877 | 4930202 | 004930202 | 00493020 | ACTIVISION INC /NY | ACTIVISION INC NEW |
12 | 733269 | 5125109 | 005125109 | 00512510 | ACXIOM CORP | ACXIOM CORP |
13 | 733269 | 5125109 | 005125109 | 00512510 | ACXIOM HOLDINGS, INC. | ACXIOM CORP |
14 | 733269 | 5125109 | 005125109 | 00512510 | LIVERAMP HOLDINGS, INC. | ACXIOM CORP |
15 | 733269 | 5125109 | 005125109 | 00512510 | ACXIOM CORP | ACXIOM HOLDINGS INC |
16 | 733269 | 5125109 | 005125109 | 00512510 | ACXIOM HOLDINGS, INC. | ACXIOM HOLDINGS INC |
17 | 733269 | 5125109 | 005125109 | 00512510 | LIVERAMP HOLDINGS, INC. | ACXIOM HOLDINGS INC |
18 | 804212 | 9363102 | 009363102 | 00936310 | AIRGAS INC | AIRGAS INC |
19 | 849116 | 4403101 | 004403101 | 00440310 | ACE CASH EXPRESS INC/TX | ACE CASH EXPRESS INC |
20 | 879796 | 1930205 | 001930205 | 00193020 | ARI NETWORK SERVICES INC /WI | A R I NETWORK SERVICES INC |
21 | 899629 | 4239109 | 004239109 | 00423910 | ACADIA REALTY TRUST | ACADIA REALTY TRUST |
22 | 899629 | 4239109 | 004239109 | 00423910 | MARK CENTERS TRUST | ACADIA REALTY TRUST |
23 | 907687 | 4934105 | 004934105 | 00493410 | ACTEL CORP | ACTEL CORP |
24 | 908598 | 817R103 | 00817R103 | 00817R10 | AETRIUM INC | AETRIUM INC |
25 | 908598 | 817R103 | 00817R103 | 00817R10 | ATRM HOLDINGS, INC. | AETRIUM INC |
26 | 909276 | 504A108 | 00504A108 | 00504A10 | ACTIONPOINT INC | ACTIONPOINT INC |
27 | 909276 | 504A108 | 00504A108 | 00504A10 | CAPTIVA SOFTWARE CORP | ACTIONPOINT INC |
28 | 909276 | 504A108 | 00504A108 | 00504A10 | CORNERSTONE IMAGING INC | ACTIONPOINT INC |
29 | 909276 | 504A108 | 00504A108 | 00504A10 | INPUT SOFTWARE INC | ACTIONPOINT INC |
30 | 912601 | 4936100 | 004936100 | 00493610 | ACRES GAMING INC | ACRES GAMING INC |
31 | 931784 | 2420107 | 002420107 | 00242010 | APPLIED VOICE TECHNOLOGY INC /WA/ | AVT CORPORATION |
32 | 931784 | 2420107 | 002420107 | 00242010 | AVT CORP | AVT CORPORATION |
33 | 931784 | 2420107 | 002420107 | 00242010 | CAPTARIS INC | AVT CORPORATION |
34 | 934549 | 3881307 | 003881307 | 00388130 | ACACIA RESEARCH CORP | ACACIA RESEARCH CORP |
35 | 935503 | 8363103 | 008363103 | 00836310 | AG-CHEM EQUIPMENT CO INC | AG CHEM EQUIPMENT INC |
36 | 943003 | 8916108 | 008916108 | 00891610 | AGRIUM INC | AGRIUM INC |
37 | 1002225 | 7974108 | 007974108 | 00797410 | ADVENT SOFTWARE INC /DE/ | ADVENT SOFTWARE INC |
38 | 1021080 | 2553105 | 002553105 | 00255310 | ABACUS DIRECT CORP | ABACUS DIRECT CORP |
39 | 1086844 | 9367103 | 009367103 | 00936710 | AIRGATE PCS INC /DE/ | AIRGATE P C S INC |
40 | 1086844 | 9367103 | 009367103 | 00936710 | AIRGATE WIRELESS INC | AIRGATE P C S INC |
41 | 1168213 | 7865108 | 007865108 | 00786510 | AEROPOSTALE INC | AEROPOSTALE INC |
42 | 1168213 | 7865108 | 007865108 | 00786510 | ARO LIQUIDATION, INC. | AEROPOSTALE INC |
All of the distinct cik-cusip combinations seem to be legitimate matches here. Furthermore, defining dataframes with the matches to cusipm.issuer
and to crsp.stocknames
respectively as m7_issuers
and m7_stocknames
m7_issuers <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>%
filter(cusip_length.y == 9 &
cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>%
distinct(cik, cusip.x, cusip.y) %>%
filter(grepl('^[0]{2}', cusip.y)) %>%
mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>%
filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>%
mutate(cusip6 = substr(cusip.y, 1, 6)) %>%
inner_join(issuers) %>% left_join(ciks) %>%
distinct(cik, cusip.x, cusip.y, cusip6, company_name, issuer_name_1)
m7_stocknames <- cusip7s %>% inner_join(cusip_cik, by = 'cik') %>%
filter(cusip_length.y == 9 &
cusip.x == substr(cusip.y, nchar(cusip.y) - 6, nchar(cusip.y))) %>%
distinct(cik, cusip.x, cusip.y) %>%
filter(grepl('^[0]{2}', cusip.y)) %>%
mutate(check_digit = calculate_cusip_check_digit(cusip.y)) %>%
filter(substr(cusip.y, 9, 9) == as.character(check_digit)) %>%
mutate(ncusip = substr(cusip.y, 1, 8)) %>% inner_join(stocknames) %>%
left_join(ciks) %>%
distinct(cik, cusip.x, cusip.y, ncusip, company_name, comnam)
We see that there are 25 distinct matches
> m7_issuers %>%
distinct(cik, cusip.x) %>% count()
# A tibble: 1 x 1
n
<int>
1 25
and further that every match in m7_issuers
is in m7_stocknames
and vice versa.
> m7_stocknames %>% distinct(cik, cusip.x) %>%
rename(cusip = cusip.x) %>% anti_join(m7_issuers %>%
distinct(cik, cusip.x) %>% rename(cusip = cusip.x))
Joining, by = c("cik", "cusip")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, cusip <chr>
> m7_issuers %>% distinct(cik, cusip.x) %>%
rename(cusip = cusip.x) %>%
anti_join(m7_stocknames %>% distinct(cik, cusip.x) %>% rename(cusip = cusip.x))
Joining, by = c("cik", "cusip")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, cusip <chr>
@iangow Like I did with my previous posts in #89 and #86, I'm going to discuss some detail of what I did, and then just state where you can find my result. Further down in my notebook handle_cusip_cik_exceptions.ipynb
, you can find some cells defining dataframes
added_value_7s_above_10_w_issuers
, added_value_7s_above_10_stocknames
, added_value_7s_above_10_1pad_w_issuers
, added_value_7s_above_10_1pad_w_stocknames
, added_value_7s_above_10_2pad_w_issuers
and added_value_7s_above_10_2pad_w_stocknames
. These are defined analogously to similar dataframes mentioned in #89 and #86, with the two dataframes containing "1pad" being rows where we consider cusip
to be cusip_raw
padded with a single zero on the left (ie. '5627733' becomes '05627733'), and those containing "2pad" being cases where we consider cusip
to be cusip_raw
padded with two zeros on the left (ie. '5627733' becomes '005627733').
Note also, that in the first two dataframes corresponding to the unpadded cases, I take cusip
to be the first six characters of cusip_raw
. The reason I did this was because the relation of the seventh digit to the cusip6 is not consistent: sometimes seventh digit is a check digit (like those found in the issuer_check
field in cusipm.issuer
), sometimes it is shorthand for some digit with two zeros (ie. if the digit is '1', the 2-digit issuer number (7th and 8th digits of a 9-digit cusip) is '10', with the check digit being '0', so short for '100'). So I took the cusip6 to be the cusip
for the sake of simplicity and consistency. As the number of cases was relatively low, I also ignored approximate string matching.
You can find the result for the unpadded cusips using
SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 6
the result for the 1-padded cusips using
SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 8
and the result for the 2-padded cusips using
SELECT * FROM edgar.cusip_cik_exceptions
WHERE LENGTH(cusip_raw) = 7 AND LENGTH(cusip) = 9
I believe we can close this issue now.
@iangow As discussed in #84, this issue is for analysing cusips with 7 characters.