mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Examine bad CUSIPs #64

Closed iangow closed 4 years ago

iangow commented 4 years ago

One option would be to ignore these. But probably worth checking some to see if there are easy fixes:

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")
bad_cusips <-
  cusip_cik %>%
  mutate(bad_cusip = case_when(! right(cusip, 1L) %~% '[0-9]' ~ TRUE,
                               as.integer(right(cusip, 1L)) != check_digit ~ TRUE,
                               TRUE ~ FALSE)) %>%
  filter(bad_cusip) %>%
  select(-bad_cusip) %>%
  compute()

bad_cusips %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 66311

Created on 2019-12-05 by the reprex package (v0.3.0)

bdcallen commented 4 years ago

@iangow I've already examined a few of these bad cusips. In getting the full set of bad cusips, the query I would make is

crsp=> SELECT COUNT(*) FROM edgar.cusip_cik 
WHERE (LENGTH(cusip) = 9
AND right(cusip, 1) != CAST(check_digit AS CHARACTER))
OR LENGTH(cusip) <= 8
OR (LENGTH(cusip) = 9 
AND right(cusip, 1) !~ '[0-9]');
 count 
-------
 70949
(1 row)

Here, I'm including the cases with cusip length 8, although in principle the check digit can be used to complete those cases. For cusip lengths less than 8, the check digit cannot be determined, unless one can deduce the issue identifier, which corresponds to the 7th and 8th digits (I should actually change check_digit to be null for all these cases.)

bdcallen commented 4 years ago

@iangow A common issue with bad cusips is stopwords, like common, stock, pages, schedule and so on.

crsp=> SELECT cusip, COUNT(*) AS freq FROM edgar.cusip_cik 
WHERE cusip !~'[0-9]'
GROUP BY cusip
ORDER BY freq DESC;
   cusip   | freq 
-----------+------
 SEECOVER  | 1182
 SCHEDULE  |  395
 MMONSTOCK |  213
 PPLICABLE |  163
 PPLIEDFOR |   60
 AVAILABLE |   56
 COMMON    |   50
 ROBERT    |   34
 XXXXXXXXX |   30
 SIGNATURE |   24
 CHERYL    |   10
 HELLMAN   |   10
 PAGEOF    |   10
 COMCLA    |    8
 SUBVTGSHS |    8
 DEPRCPT   |    8
 WILLIAM   |    6
 DOMECQPLC |    6
 APPLIED   |    6
 ACUSIP    |    6
 RREDSTOCK |    6
 LINTEREST |    6
 AMENDMENT |    4
 GLBLDIVID |    4
 PRISESINC |    4
 SUBVTGSH  |    4
 FORCOMMON |    4
 CLASUBVTG |    4
 THECUSIP  |    4
 UNITED    |    4
 EPERSHARE |    4
 IACWIEXUS |    4
 TBCCORP   |    2
 PFDCONV   |    2
 AFRICRAND |    2
 AGEMENTCO |    2
 ANNUAL    |    2
 ANTSHARES |    2
 ARYSHARES |    2
 AUGUST    |    2
 BAYCITY   |    2
 CLASSB    |    2
 COMNEW    |    2
 CONVPREF  |    2
 EASSIGNED |    2
 ECOMPLETE |    2
 FEBRUARY  |    2
 FORTHE    |    2
 HERBERT   |    2
 MITHTRUST |    2
 NDJUBSTIN |    2
 NORMAN    |    2
 NOVEMBER  |    2
 OCTOBER   |    2
 OLEUMCORP |    2
 PARVALUE  |    2
 PAXSON    |    2
 PIERCE    |    2
 REITKBSGI |    2
 REITSSSRH |    2
 SCHECULE  |    2
 SCHEDULEA |    2
 SECURITY  |    2
 SHIRLEY   |    2
 STOCKNONE |    2
 VOTING    |    1
(66 rows)
bdcallen commented 4 years ago

@iangow Particularly bad cusips are the ones for which, as alluded to above, have less than 8 characters, as we need 8 characters to reliably calculate a check digit.

crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) <= 7
crsp-> AND cusip ~ '[0-9]';
 count 
-------
 14534
(1 row)
bdcallen commented 4 years ago

@iangow

crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
crsp-> WHERE LENGTH(cusip) = 9
crsp-> AND RIGHT(cusip, 1) 
crsp-> != CAST(check_digit AS CHARACTER);
 count 
-------
 23808
(1 row)

crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) = 8;
 count 
-------
 32426
(1 row)

crsp=> SELECT COUNT(*) FROM edgar.cusip_cik
WHERE LENGTH(cusip) < 8;
 count 
-------
 14715
(1 row)

crsp=> SELECT 14715 + 32426 + 23808;
 ?column? 
----------
    70949
(1 row)

So 32426 have eight characters. The vast majority are probably good cusips in the sense that we can calculate a check digit for them and they contain no stopwords.

iangow commented 4 years ago

@iangow A common issue with bad cusips is stopwords, like common, stock, pages, schedule and so on.


crsp=> SELECT cusip, COUNT(*) AS freq FROM edgar.cusip_cik 
WHERE cusip !~'[0-9]'
GROUP BY cusip
ORDER BY freq DESC;

I think we might exclude these when scraping the files. We could delete the ones we have already.

iangow commented 4 years ago

Also, we could use the check digit when scraping. We could delete the ones we have (it might make sense to delete them from the "processed" table if we can see a fix that would work on a second pass).

iangow commented 4 years ago

Interestingly the format varies for these:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik <- tbl(pg, "cusip_cik")

cusip_cik %>%
  filter(cusip %!~% '[0-9]') %>%
  count(cusip, formats) %>%
  arrange(desc(n))
#> # Source:     lazy query [?? x 3]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:     cusip
#> # Ordered by: desc(n)
#>    cusip     formats n      
#>    <chr>     <chr>   <int64>
#>  1 SEECOVER  D       1164   
#>  2 SCHEDULE  C        340   
#>  3 MMONSTOCK AB       175   
#>  4 PPLICABLE AB       137   
#>  5 PPLIEDFOR A         60   
#>  6 SCHEDULE  D         55   
#>  7 AVAILABLE AB        52   
#>  8 MMONSTOCK A         38   
#>  9 ROBERT    D         34   
#> 10 COMMON    C         28   
#> # … with more rows

Created on 2019-12-10 by the reprex package (v0.3.0)

iangow commented 4 years ago

@bdcallen

I think the rule of thumb we want to use with this table is: If it's listed as a CUSIP in a filing then we want it in the table, even if it's six digits, etc. But I would guess if we looked at the filings yielding the SEECOVER, SCHEDULE, etc., options above then these are not listed as CUSIPs, so we don't want them.

This table should be a little bit "raw data"; subsequent work might clarify six-digit or seven-digit CUSIPs (in the former case, these may be just the issuer component, in the latter, they may drop "leading zeros" … even though CUSIPs are not numbers, so the zeroes aren't really "leading").

iangow commented 4 years ago

But we shouldn't go crazy trying to solve all problems.

bdcallen commented 4 years ago

@iangow Just read the reference you sent me here. I've actually been wondering if a subset of these cases are different types of numbers given erroneously as cusip numbers (like gvkey and permno). In fact, I know for a fact that some filings of these types quote Sedol numbers instead of Cusip numbers (usually these appear in similar regex's to those for cusips, but with SEDOL replacing CUSIP in the regex). I think alternative numbers is a hypothesis worth exploring for some cases.

bdcallen commented 4 years ago

@iangow Also, would WRDS contain the issue identifiers (7th and 8th digit), as well as a description of the security? We could perhaps use that to deduce the correct 9 digit cusip for cases where the number of characters is less than 8.

iangow commented 4 years ago

If we could easily flag them as SEDOLs, then perhaps we could do that. But perhaps only worthwhile if pretty common. Easier would be to skip numbers that we flag as SEDOLs.

iangow commented 4 years ago

@iangow Also, would WRDS contain the issue identifiers (7th and 8th digit), as well as a description of the security? We could perhaps use that to deduce the correct 9 digit cusip for cases where the number of characters is less than 8.

There would be more than one in general. As mentioned, if we can confirm (somehow) that the provided CUSIP is six digits, then we should just collect that for this table. Any back-end clean-up using other data sources should be done elsewhere. You don't want a table that is a mix of raw data and processed-in-database data.

iangow commented 4 years ago

I think this issue only makes sense when the CUSIPs have nine digits. So I'm closing this one in favor of #85.