Closed iangow closed 4 years ago
@iangow I've already examined a few of these bad cusips. In getting the full set of bad cusips, the query I would make is
crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE (LENGTH(cusip) = 9
AND right(cusip, 1) != CAST(check_digit AS CHARACTER))
OR LENGTH(cusip) <= 8
OR (LENGTH(cusip) = 9
AND right(cusip, 1) !~ '[0-9]');
count
-------
70949
(1 row)
Here, I'm including the cases with cusip length 8, although in principle the check digit can be used to complete those cases. For cusip lengths less than 8, the check digit cannot be determined, unless one can deduce the issue identifier, which corresponds to the 7th and 8th digits (I should actually change check_digit to be null for all these cases.)
@iangow A common issue with bad cusips is stopwords, like common
, stock
, pages
, schedule
and so on.
crsp=> SELECT cusip, COUNT(*) AS freq FROM edgar.cusip_cik
WHERE cusip !~'[0-9]'
GROUP BY cusip
ORDER BY freq DESC;
cusip | freq
-----------+------
SEECOVER | 1182
SCHEDULE | 395
MMONSTOCK | 213
PPLICABLE | 163
PPLIEDFOR | 60
AVAILABLE | 56
COMMON | 50
ROBERT | 34
XXXXXXXXX | 30
SIGNATURE | 24
CHERYL | 10
HELLMAN | 10
PAGEOF | 10
COMCLA | 8
SUBVTGSHS | 8
DEPRCPT | 8
WILLIAM | 6
DOMECQPLC | 6
APPLIED | 6
ACUSIP | 6
RREDSTOCK | 6
LINTEREST | 6
AMENDMENT | 4
GLBLDIVID | 4
PRISESINC | 4
SUBVTGSH | 4
FORCOMMON | 4
CLASUBVTG | 4
THECUSIP | 4
UNITED | 4
EPERSHARE | 4
IACWIEXUS | 4
TBCCORP | 2
PFDCONV | 2
AFRICRAND | 2
AGEMENTCO | 2
ANNUAL | 2
ANTSHARES | 2
ARYSHARES | 2
AUGUST | 2
BAYCITY | 2
CLASSB | 2
COMNEW | 2
CONVPREF | 2
EASSIGNED | 2
ECOMPLETE | 2
FEBRUARY | 2
FORTHE | 2
HERBERT | 2
MITHTRUST | 2
NDJUBSTIN | 2
NORMAN | 2
NOVEMBER | 2
OCTOBER | 2
OLEUMCORP | 2
PARVALUE | 2
PAXSON | 2
PIERCE | 2
REITKBSGI | 2
REITSSSRH | 2
SCHECULE | 2
SCHEDULEA | 2
SECURITY | 2
SHIRLEY | 2
STOCKNONE | 2
VOTING | 1
(66 rows)
@iangow Particularly bad cusips are the ones for which, as alluded to above, have less than 8 characters, as we need 8 characters to reliably calculate a check digit.
crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) <= 7
crsp-> AND cusip ~ '[0-9]';
count
-------
14534
(1 row)
@iangow
crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
crsp-> WHERE LENGTH(cusip) = 9
crsp-> AND RIGHT(cusip, 1)
crsp-> != CAST(check_digit AS CHARACTER);
count
-------
23808
(1 row)
crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) = 8;
count
-------
32426
(1 row)
crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) < 8;
count
-------
14715
(1 row)
crsp=> SELECT 14715 + 32426 + 23808;
?column?
----------
70949
(1 row)
So 32426 have eight characters. The vast majority are probably good cusips in the sense that we can calculate a check digit for them and they contain no stopwords.
@iangow A common issue with bad cusips is stopwords, like
common
,stock
,pages
,schedule
and so on.crsp=> SELECT cusip, COUNT(*) AS freq FROM edgar.cusip_cik WHERE cusip !~'[0-9]' GROUP BY cusip ORDER BY freq DESC;
I think we might exclude these when scraping the files. We could delete the ones we have already.
Also, we could use the check digit when scraping. We could delete the ones we have (it might make sense to delete them from the "processed" table if we can see a fix that would work on a second pass).
Interestingly the format varies for these:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik %>%
filter(cusip %!~% '[0-9]') %>%
count(cusip, formats) %>%
arrange(desc(n))
#> # Source: lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups: cusip
#> # Ordered by: desc(n)
#> cusip formats n
#> <chr> <chr> <int64>
#> 1 SEECOVER D 1164
#> 2 SCHEDULE C 340
#> 3 MMONSTOCK AB 175
#> 4 PPLICABLE AB 137
#> 5 PPLIEDFOR A 60
#> 6 SCHEDULE D 55
#> 7 AVAILABLE AB 52
#> 8 MMONSTOCK A 38
#> 9 ROBERT D 34
#> 10 COMMON C 28
#> # … with more rows
Created on 2019-12-10 by the reprex package (v0.3.0)
@bdcallen
I think the rule of thumb we want to use with this table is: If it's listed as a CUSIP in a filing then we want it in the table, even if it's six digits, etc. But I would guess if we looked at the filings yielding the SEECOVER
, SCHEDULE
, etc., options above then these are not listed as CUSIPs, so we don't want them.
This table should be a little bit "raw data"; subsequent work might clarify six-digit or seven-digit CUSIPs (in the former case, these may be just the issuer component, in the latter, they may drop "leading zeros" … even though CUSIPs are not numbers, so the zeroes aren't really "leading").
But we shouldn't go crazy trying to solve all problems.
@iangow Just read the reference you sent me here. I've actually been wondering if a subset of these cases are different types of numbers given erroneously as cusip numbers (like gvkey and permno). In fact, I know for a fact that some filings of these types quote Sedol numbers instead of Cusip numbers (usually these appear in similar regex's to those for cusips, but with SEDOL
replacing CUSIP
in the regex). I think alternative numbers is a hypothesis worth exploring for some cases.
@iangow Also, would WRDS
contain the issue identifiers (7th and 8th digit), as well as a description of the security? We could perhaps use that to deduce the correct 9 digit cusip for cases where the number of characters is less than 8.
If we could easily flag them as SEDOLs, then perhaps we could do that. But perhaps only worthwhile if pretty common. Easier would be to skip numbers that we flag as SEDOLs.
@iangow Also, would
WRDS
contain the issue identifiers (7th and 8th digit), as well as a description of the security? We could perhaps use that to deduce the correct 9 digit cusip for cases where the number of characters is less than 8.
There would be more than one in general. As mentioned, if we can confirm (somehow) that the provided CUSIP is six digits, then we should just collect that for this table. Any back-end clean-up using other data sources should be done elsewhere. You don't want a table that is a mix of raw data and processed-in-database data.
I think this issue only makes sense when the CUSIPs have nine digits. So I'm closing this one in favor of #85.
One option would be to ignore these. But probably worth checking some to see if there are easy fixes:
Created on 2019-12-05 by the reprex package (v0.3.0)