mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Address Cusips with more than 9 characters #81

Open bdcallen opened 4 years ago

bdcallen commented 4 years ago

@iangow

We have different results when the CUSIPs over 9 digits. I think the solution here is to write the regular expressions to capture the full CUSIP and to ditch CUSIPs that are too long to be valid. Capturing part of a twelve-digit CUSIP is just a recipe for bad matches in my view.

We have some CUSIPs that are actually ISINs (these will appear as "bad CUSIPs" if we apply the idea in (4) above). I'm not sure we want to bother with these. Only if we had a number of valid ISINs that yielded CUSIP-CIK matches that we wouldn't otherwise get would be get into the business of detecting ISINs and converting them to CUSIPs.

This issue is in reference to these points, in #76. I think an alternative option would be to keep the full raw character string, and having a separate field which classifies whether it is a cusip or not. That would allow us the opportunity to see if the ISIN's contain the correct CUSIPs, if we so choose.

iangow commented 4 years ago

It seems we'd have to tweak the regular expressions to capture these. I'm not sure that this would be worth the effort, as we'd need to filter out a lot of rubbish that we'd be adding.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")

cusip_cik %>% filter(nchar(cusip) > 9)
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 6 variables: file_name <chr>, cusip <chr>, check_digit <int>,
#> #   cik <int>, company_name <chr>, formats <chr>

Created on 2020-04-22 by the reprex package (v0.3.0)

bdcallen commented 4 years ago

@iangow Yes, I wrote the initial python code to only accept cusip candidates up to 9 characters, hence the result you have found in the code snippet in the post above. Tweaking the regular expressions would not be that difficult, in my opinion, we'd just have to put the upper limit on the number of characters matched for a 'cusip' up to 12, as these could correspond to an ISIN. Anything more than that, however, is clearly a mistake. I tend to agree with you, in that it would add work in filtering out rubbish. It is possible that we have missed some potential cik-cusip matches due to some filers quoting the ISIN number, which contains the cusip number if it has been derived from a security with a cusip number, which is why I made this issue. I'm not sure exactly how many matches we might have lost (I'm guessing it is rather a small number, though I'm not sure if it is small enough to make it insignificant).

iangow commented 4 years ago

Let's push this down the priority list. Focus instead on tweaks that can be evaluated with data we already have. For example, I think #82 and #77 should be higher on the list.

bdcallen commented 3 years ago

@iangow Is this a priority? I think it would be helpful to do a little cost-benefit analysis. In my opinion, going beyond 9 digits to capture 12 digit ISIN numbers, and then to derive 9-digit cusips from the ISINs is not worth it. It seems that doing so will just introduce more noise for little gain. If you agree with me, I think we close this

iangow commented 3 years ago

Just add defer label for now.