Evaluate giving primacy to 9-digit valid CUSIPs - Githubissues

mccgr / edgar

Code to manage data related to SEC EDGAR

31 stars 15 forks source link

Evaluate giving primacy to 9-digit valid CUSIPs #82

Closed iangow closed 4 years ago

iangow commented 4 years ago

By "giving primacy", I mean dropping everything else from any filing that has a valid 9-digit CUSIP.

Below is some starter code. It seems that we don't have any filings with multiple valid 9-digit CUSIPs, so this might help to solve the "multiple CUSIPs" issue (see #77).

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")

valid_cusip9s <-
    cusip_cik %>%
    filter(nchar(cusip) == 9) %>%
    filter(substr(cusip, 9, 9) == as.character(check_digit)) %>%
    compute()

valid_cusip9s %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 1259891

valid_cusip9s %>%
    distinct(cusip, cik) %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 45056

# Now what would we drop if we ditched all CUSIP-CIK combinations
# found in filings with at least one valid nine-digit CUSIP?
to_drop <-
    cusip_cik %>%
    semi_join(valid_cusip9s, by="file_name") %>%
    anti_join(valid_cusip9s, by=c("file_name", "cusip"))

# And would there be any valid nine-digit CUSIPs in those dropped?
to_drop %>%
    filter(nchar(cusip) == 9) %>%
    filter(substr(cusip, 9, 9) == as.character(check_digit)) 
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 6 variables: file_name <chr>, cusip <chr>, check_digit <int>,
#> #   cik <int>, company_name <chr>, formats <chr>

# This would be the new table
cusip_cik_new <-
    cusip_cik %>%
    anti_join(to_drop) %>%
    compute()
#> Joining, by = c("file_name", "cusip", "check_digit", "cik", "company_name", "formats")

# Seems to be some junk dropped here.
to_drop %>% count(cusip) %>% arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>    cusip     n      
#>    <chr>     <int64>
#>  1 SEECOVER  446    
#>  2 48625010  326    
#>  3 SCHEDULE  298    
#>  4 MMONSTOCK 198    
#>  5 074223065  74    
#>  6 13152106   58    
#>  7 82454310   54    
#>  8 760281     42    
#>  9 0PARVALUE  40    
#> 10 775728101  38    
#> # … with more rows

# Check the number of rows dropped.
to_drop %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 17018
cusip_cik %>% count() %>% pull() - (cusip_cik_new %>% count() %>% pull())
#> integer64
#> [1] 17018

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

bdcallen commented 4 years ago

@iangow

> cusip_cik %>% distinct(cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 26898
> cusip_cik %>% filter(cusip_length == 9) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 25033

On a simple level, we lose matchings to around 1800 ciks. Of course this doesn't take into account the frequency yet.

bdcallen commented 4 years ago

@iangow The discrepancy is reduced once we take into account the cases with a multiplicity more than or equal to 10

> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 17618
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 16983

So around 635 are lost, if we restrict to 9 digits.

bdcallen commented 4 years ago

@iangow Here's the analogous work with cusips

> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 22222
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 23334

bdcallen commented 4 years ago

@iangow Here's what we lose for cik-cusip pairs with multiplicities more than 10

> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 26548
> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 22869

bdcallen commented 4 years ago

@iangow Here's the distinct cik-cusip6 pair analysis

> cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip6, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 20471
> cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip6, cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 23488

Going by cusip6, we reduce what's lost to around 3000 from around 3700 for looking at the full cusips.

bdcallen commented 4 years ago

@iangow

> dist_cus6_w9 <- cusip_cik %>% filter(multiplicity >= 10) %>% filter(cusip_length == 9) %>% distinct(cusip6, cik)
> dist_cus6 <- cusip_cik %>% filter(multiplicity >= 10) %>% distinct(cusip6, cik)
> dist_cus6 %>% anti_join(dist_cus6_w9)
Joining, by = c("cik", "cusip6")
# A tibble: 3,017 x 2
     cik cusip6
   <int> <chr> 
 1    20 NA    
 2  2034 NA    
 3  2135 NA    
 4  2488 NA    
 5  2491 364654
 6  2491 NA    
 7  2809 NA    
 8  3000 NA    
 9  3327 NA    
10  3449 NA    
# … with 3,007 more rows
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% count()
Joining, by = c("cik", "cusip6")
# A tibble: 1 x 1
      n
  <int>
1  3017
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(is.na(cusip6)) %>% count()
Joining, by = c("cik", "cusip6")
# A tibble: 1 x 1
      n
  <int>
1  2544

So the vast majority of the cusip6's eliminated by filtering for cases which have a cusip9 with a multiplicity more than 10 are cases where cusip6 is NA.

bdcallen commented 4 years ago

@iangow

> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(!is.na(cusip6)) %>% inner_join(ciks) %>% inner_join(issuers)
Joining, by = c("cik", "cusip6")
Joining, by = "cik"
Joining, by = "cusip6"
# A tibble: 341 x 25
     cik cusip6 company_name issuer_check issuer_name_1 issuer_name_2 issuer_name_3 issuer_adl_1 issuer_adl_2 issuer_adl_3 issuer_adl_4 issuer_sort_key issuer_type
   <int> <chr>  <chr>        <chr>        <chr>         <chr>         <chr>         <chr>        <chr>        <chr>        <chr>        <chr>           <chr>      
 1  2491 364654 ALLIANCE GA… 4            GAMING & TEC… NA            NA            NAME CHANGE… INC 12/08/1… NA           NA           GAMING & TECHN… C          
 2  2491 364654 BALLY TECHN… 4            GAMING & TEC… NA            NA            NAME CHANGE… INC 12/08/1… NA           NA           GAMING & TECHN… C          
 3  6814 544118 COMFORCE CO… 3            LORI CORP     NA            NA            REORGANIZED… 12/01/1995   NA           NA           LORI CORP     … C          
 4  6814 544118 LORI CORP    3            LORI CORP     NA            NA            REORGANIZED… 12/01/1995   NA           NA           LORI CORP     … C          
 5  8038 046352 ASTREX INC   1            ASTRALIS LTD  NA            NA            FORMERLY AS… PHARMACEUTI… NA           NA           ASTRALIS LTD  … C          
 6 16496 131521 CALPROP CORP 7            CALVERT CASH… NA            NA            NAME CHANGE… PLUS 09/10/… NA           NA           CALVERT CASH R… C          
 7 26537 233712 ARGON ST, I… 9            DAEDALUS ENT… NA            NA            NAME CHANGE… TECHNOLOGIE… SEE 81726S   NA           DAEDALUS ENTER… C          
 8 26537 233712 DAEDALUS EN… 9            DAEDALUS ENT… NA            NA            NAME CHANGE… TECHNOLOGIE… SEE 81726S   NA           DAEDALUS ENTER… C          
 9 26537 233712 SENSYS TECH… 9            DAEDALUS ENT… NA            NA            NAME CHANGE… TECHNOLOGIE… SEE 81726S   NA           DAEDALUS ENTER… C          
10 26537 233712 SENSYTECH I… 9            DAEDALUS ENT… NA            NA            NAME CHANGE… TECHNOLOGIE… SEE 81726S   NA           DAEDALUS ENTER… C          
# … with 331 more rows, and 12 more variables: issuer_status <chr>, issuer_del_date <date>, issuer_transaction <chr>, issuer_state_code <chr>,
#   issuer_update_date <date>, cabre_id <chr>, cabre_status <chr>, lei_cici <chr>, legal_entity_name <chr>, previous_name <chr>, entry_date <date>,
#   cp_institution_type <chr>
> dist_cus6 %>% anti_join(dist_cus6_w9) %>% filter(!is.na(cusip6)) %>% inner_join(ciks) %>% inner_join(issuers) %>% distinct(cik, cusip6)
Joining, by = c("cik", "cusip6")
Joining, by = "cik"
Joining, by = "cusip6"
# A tibble: 188 x 2
     cik cusip6
   <int> <chr> 
 1  2491 364654
 2  6814 544118
 3  8038 046352
 4 16496 131521
 5 26537 233712
 6 34408 307001
 7 59963 254745
 8 70415 637130
 9 72843 665262
10 74154 682678
# … with 178 more rows

So around a third of the remaining cases join onto the issuers table by cusip6. Next we want to check name matching.

bdcallen commented 4 years ago

@iangow Just a reminder, in my work on this, the issuers dataframe is essentially cusipm.issuers, with the field issuer_num renamed to cusip6 for the convenience of joining tables of interest easily.

iangow commented 4 years ago

@bdcallen We don't seem to be making progress here (on this whole CUSIP-CIK project). So please put it aside for now and I will take a look at it when I get a chance. (I don't follow how your code is building on the code I posted above. This was [deliberately] meant to be a very focused issue asking a narrow question.)

iangow commented 4 years ago

OK. So it seems that in 9 of 10 cases examined, throwing out the eight-digit CUSIPs from a filing where there is a valid nine-digit CUSIP is either harmless (because the eight-digit CUSIP is the same except for the check digit) or helpful (because the eight-digit CUSIP is wrong).

The one exception from the sample is the CUSIP associated with Dimark Inc below. The nine-digit CUSIP recovered there appears to be another case where the filing entity mistakenly supplied its own CUSIP. There is no way to fix this through parsing (we have correctly parsed a bad filing).

So on the basis of the above, we should delete all other CUSIPs in filings where there is a valid nine-digit CUSIP. We could do that in the Python code (once we get back to it).

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")

valid_cusip9s <-
    cusip_cik %>%
    filter(nchar(cusip) == 9) %>%
    filter(substr(cusip, 9, 9) == as.character(check_digit)) %>%
    compute()

other_cusips_8 <-
    cusip_cik %>%
    semi_join(valid_cusip9s, by = "file_name") %>%
    anti_join(valid_cusip9s, by = c("file_name", "cusip")) %>%
    filter(nchar(cusip) == 8L) %>%
    compute()

other_cusips_8 %>%
    select(file_name)
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>    file_name                                  
#>    <chr>                                      
#>  1 edgar/data/752714/0001089355-00-000190.txt 
#>  2 edgar/data/894253/0000950134-00-001129.txt 
#>  3 edgar/data/932290/0000922423-99-000556.txt 
#>  4 edgar/data/110471/0000922907-00-000188.txt 
#>  5 edgar/data/1197708/0000950152-02-007550.txt
#>  6 edgar/data/883899/0000950134-96-000458.txt 
#>  7 edgar/data/1052752/0001144204-19-007744.txt
#>  8 edgar/data/1345523/0001019687-13-004045.txt
#>  9 edgar/data/754737/0000754737-04-000035.txt 
#> 10 edgar/data/1308161/0001193125-04-195185.txt
#> # … with more rows

cusip_cik %>% filter(file_name == "edgar/data/752714/0001089355-00-000190.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                    cusip   check_digit    cik company_name   formats
#>   <chr>                        <chr>         <int>  <int> <chr>          <chr>  
#> 1 edgar/data/752714/000108935… 580589…           9 752714 MCGRATH RENTC… AB     
#> 2 edgar/data/752714/000108935… 505891…           3 752714 MCGRATH RENTC… C
cusip_cik %>% filter(file_name == "edgar/data/894253/0000950134-00-001129.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                   cusip   check_digit    cik company_name    formats
#>   <chr>                       <chr>         <int>  <int> <chr>           <chr>  
#> 1 edgar/data/894253/00009501… 714265…           5 894253 PEROT SYSTEMS … ABD    
#> 2 edgar/data/894253/00009501… 142651…           8 894253 PEROT SYSTEMS … C
cusip_cik %>% filter(file_name == "edgar/data/932290/0000922423-99-000556.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                    cusip   check_digit    cik company_name   formats
#>   <chr>                        <chr>         <int>  <int> <chr>          <chr>  
#> 1 edgar/data/932290/000092242… 886027…           1 932290 THRUSTMASTER … A      
#> 2 edgar/data/932290/000092242… 886027…           1 932290 THRUSTMASTER … C
cusip_cik %>% filter(file_name == "edgar/data/110471/0000922907-00-000188.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                cusip  check_digit    cik company_name        formats
#>   <chr>                    <chr>        <int>  <int> <chr>               <chr>  
#> 1 edgar/data/110471/00009… 97809…           3 110471 WOLVERINE WORLD WI… ABC    
#> 2 edgar/data/110471/00009… 97809…           1 110471 WOLVERINE WORLD WI… D
cusip_cik %>% filter(file_name == "edgar/data/1197708/0000950152-02-007550.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                  cusip   check_digit    cik company_name     formats
#>   <chr>                      <chr>         <int>  <int> <chr>            <chr>  
#> 1 edgar/data/1197708/000095… 892081…           0 906110 TOWN & COUNTRY … A      
#> 2 edgar/data/1197708/000095… 892081…           1 906110 TOWN & COUNTRY … C
cusip_cik %>% filter(file_name == "edgar/data/883899/0000950134-96-000458.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                      cusip   check_digit    cik company_name formats
#>   <chr>                          <chr>         <int>  <int> <chr>        <chr>  
#> 1 edgar/data/883899/0000950134-… 571660…           9 883899 DIMARK INC   A      
#> 2 edgar/data/883899/0000950134-… 416196…           3 883899 DIMARK INC   C
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
stocknames %>% filter(ncusip %in% c("57166010", "41619610"))
#> # Source:   lazy query [?? x 16]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   permno permco namedt     nameenddt  cusip ncusip ticker comnam hexcd exchcd
#>    <int>  <int> <date>     <date>     <chr> <chr>  <chr>  <chr>  <dbl>  <dbl>
#> 1  70148   8483 1986-08-14 1986-08-18 2542… 57166… MGSI   MARS …     2      3
#> 2  70148   8483 1986-08-19 1992-04-26 2542… 57166… WMD    MARS …     2      2
#> 3  79903  30030 1993-11-04 1998-05-05 4161… 41619… HHS    HARTE…     1      1
#> 4  79903  30030 1998-05-06 2018-01-31 4161… 41619… HHS    HARTE…     1      1
#> # … with 6 more variables: siccd <int64>, shrcd <int64>, shrcls <chr>,
#> #   st_date <date>, end_date <date>, namedum <dbl>
filings <- tbl(pg, "filings")
filings %>% filter(cik == 883899) %>% select(company_name) %>% distinct()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   company_name
#>   <chr>       
#> 1 DIMARK INC
cusip_cik %>% filter(file_name == "edgar/data/1052752/0001144204-19-007744.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                 cusip   check_digit     cik company_name     formats
#>   <chr>                     <chr>         <int>   <int> <chr>            <chr>  
#> 1 edgar/data/1052752/00011… 374297…           9 1052752 GETTY REALTY CO… C      
#> 2 edgar/data/1052752/00011… 374297…           9 1052752 GETTY REALTY CO… ABD
cusip_cik %>% filter(file_name == "edgar/data/1345523/0001019687-13-004045.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                 cusip   check_digit     cik company_name     formats
#>   <chr>                     <chr>         <int>   <int> <chr>            <chr>  
#> 1 edgar/data/1345523/00010… 636375…           7 1023844 NATIONAL HOLDIN… ABC    
#> 2 edgar/data/1345523/00010… 363751…           4 1023844 NATIONAL HOLDIN… D
cusip_cik %>% filter(file_name == "edgar/data/754737/0000754737-04-000035.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                      cusip   check_digit    cik company_name formats
#>   <chr>                          <chr>         <int>  <int> <chr>        <chr>  
#> 1 edgar/data/754737/0000754737-… 499183…           4 754737 SCANA CORP   ACD    
#> 2 edgar/data/754737/0000754737-… 499183…           4 754737 SCANA CORP   C
cusip_cik %>% filter(file_name == "edgar/data/1308161/0001193125-04-195185.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@192.168.1.192:5434/crsp]
#>   file_name                cusip   check_digit    cik company_name       formats
#>   <chr>                    <chr>         <int>  <int> <chr>              <chr>  
#> 1 edgar/data/1308161/0001… 35138T…           7 1.07e6 FOX ENTERTAINMENT… ABD    
#> 2 edgar/data/1308161/0001… 3513T1…           1 1.07e6 FOX ENTERTAINMENT… C

^{Created on 2020-06-19 by the reprex package (v0.3.0)}

iangow commented 4 years ago

For now, we should make a "new" version of cusip_cik (say cusip_cik_test) that implements this and evaluate remaining issues (e.g., #90, #86, #87) against that version.