mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Investigate lost CUSIP-CIK matches #76

Closed iangow closed 4 years ago

iangow commented 4 years ago

@bdcallen As discussed in #63, we really should've nailed down a rough replication of the existing technology before going into other issues.

My concern is that we have code that doesn't work as well as the old code along certain dimensions and that code might be costly (in terms of time) to fix as it's got add bells and whistles now.

Below I have put some data on CUSIP-CIK matches found in the old table, but lost in the new table. I think it would be good to see if you can quickly fix the new code to recover these (assuming they are good). You should be able to focus on a handful of them and then zero in on the specific filings where the old code recovered them, etc.

Please put this at a high priority, as I have a number of things that rely on this CUSIP-CIK code and things are kind of in limbo with that right now.

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

filings <- tbl(pg, "filings")
cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik_old <- tbl(pg, "cusip_cik_old")

old_matches_lost <-
  cusip_cik_old %>%
  semi_join(filings, by="file_name") %>%
  filter(!is.na(cusip), !cusip %in% c("", "__", "PAGE"),
         !cusip %~% "_") %>%
  count(cik, cusip) %>%
  filter(n > 10) %>%
  ungroup() %>%
  anti_join(cusip_cik, by = c("cusip", "cik")) %>%
  compute()

old_matches_lost %>%
  arrange(desc(n)) %>%
  print(n=30)
#> # Source:     table<dbplyr_005> [?? x 3]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>        cik cusip     n      
#>      <int> <chr>     <int64>
#>  1  860510 29874m103 84     
#>  2 1031235 81631y102 78     
#>  3   14693 01156371  61     
#>  4  889430 09247k109 60     
#>  5  737243 004907101 60     
#>  6 1001258 046222010 58     
#>  7  789933 629579200 58     
#>  8  914478 79604v105 56     
#>  9 1267482 81663N20  55     
#> 10 1000753 7094105   52     
#> 11  879354 2942591   52     
#> 12 1058057 G5876H1   50     
#> 13  913756 068750V10 48     
#> 14  880460 71376c100 48     
#> 15  782149 Not       46     
#> 16  811419 23559     46     
#> 17   60128 54141010  46     
#> 18 1090159 M565951   46     
#> 19 1172494 000002255 45     
#> 20 1026734 261384101 44     
#> 21 1448397 Y737601   43     
#> 22  723574 45244     42     
#> 23   59593 5351711   40     
#> 24  933955 87305U10  40     
#> 25  796577 7488021   38     
#> 26   96223 5272885   37     
#> 27  810117 Not       36     
#> 28  924645 58439     36     
#> 29  880643 372       36     
#> 30   41296 3745031   36     
#> # … with more rows

old_matches_lost %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 379

Created on 2020-04-22 by the reprex package (v0.3.0)

iangow commented 4 years ago

@bdcallen

So I looked into the first one. It seems that the first eight characters provide a valid CIK-CUSIP match that should be on the new table.

After digging into this one, I changed the code above to focus on common CIK-CUSIP pairs not found in the new table. The one below is not there.

(Note that the earlier version of the code, which you can get by looking at history in the comment above, probably does identify problems, as these valid CUSIP-CIK pairs that we're missing for certain filings. But for now, I think we should focus on cases where we're losing pairs.)

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

filings <- tbl(pg, "filings")
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))

stocknames %>% 
  filter(ncusip == "48273010") %>% 
  select(comnam)
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   comnam         
#>   <chr>          
#> 1 K TRON INTL INC

filings %>% 
  filter(cik==20) %>% 
  count(company_name)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   company_name             n      
#>   <chr>                    <int64>
#> 1 K TRON INTERNATIONAL INC 401

Created on 2020-04-12 by the reprex package (v0.3.0)

iangow commented 4 years ago

Looking at the list in the first column, it seems that the first two are valid except for the issue of case (the CUSIPS should be 29874M103 and 81631Y102 respectively). Perhaps the new code picks these up and fixes the case? If so, fine.

The next three appear to be duds. But the one after that seems to be a valid CUSIP-CIK combination. Not sure why the new code isn't picking it up (though, since it's there in nine-digit form, perhaps this isn't too great a concern).

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

filings <- tbl(pg, "filings")
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
cusip_cik <- tbl(pg, "cusip_cik")

stocknames %>% 
  filter(ncusip == "00490710") %>% 
  select(comnam)
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   comnam                    
#>   <chr>                     
#> 1 ACORN VENTURE CAPITAL CORP

filings %>% 
  filter(cik==737243) %>% 
  count(company_name)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   company_name               n      
#>   <chr>                      <int64>
#> 1 ACORN HOLDING CORP         87     
#> 2 ACORN VENTURE CAPITAL CORP 58     
#> 3 Valentec Systems, Inc      38

cusip_cik %>% 
  filter(cik==737243) %>% 
  count(cusip)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   cusip     n      
#>   <chr>     <int64>
#> 1 004853107 10     
#> 2 <NA>      64

Created on 2020-04-12 by the reprex package (v0.3.0)

iangow commented 4 years ago

@bdcallen Note that ncusip on crsp.stocknames is only the first eight digits of the CUSIP (i.e., no check digit).

bdcallen commented 4 years ago

@iangow

> case_matches <- old_matches_lost %>% mutate(cusip = toupper(cusip)) %>% inner_join(cusip_cik) %>% collect()
> case_matches %>% distinct(cik, cusip)
# A tibble: 57 x 2
       cik cusip    
     <int> <chr>    
 1 1015483 26784F103
 2 1095330 87155S108
 3  919794 92672P108
 4  914478 79604V105
 5 1054836 M47095100
 6  920990 59163F105
 7 1016838 M81865103
 8  913293 01877H100
 9 1057083 69325Q105
10  906551 74726X105
# … with 47 more rows

So 57 of the 415 cases are cases where the new code has missed the lower case letters but retained the upper case letters. I seem to have found the offending line

cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,9})'

So my program only looked for cases where the letters where in upper case. I could fix this by either looking for lower case letters as well, or just converting the text searched to upper case.

bdcallen commented 4 years ago

@iangow It seems that some old filings may be missing. I just did this for one of the cik, cusip pairs on the list

crsp=# SELECT * FROM edgar.cusip_cik_old
crsp-# WHERE cik = 317438;
                  file_name                  |   cusip   |  cik   |  company_name  | format
---------------------------------------------+-----------+--------+----------------+--------
 edgar/data/1078649/0001050502-02-000601.txt |           | 317438 | CLX ENERGY INC |
 edgar/data/1079197/0001076636-99-000032.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079199/0001076636-99-000027.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000019.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000020.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000021.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000022.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000023.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000024.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000026.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000027.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000032.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001076636-99-000033.txt  | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079188/0001076636-99-000026.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001050502-02-000601.txt  |           | 317438 | CLX ENERGY INC |
 edgar/data/1079186/0001076636-99-000024.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079198/0001076636-99-000020.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079190/0001076636-99-000023.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079193/0001076636-99-000033.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079196/0001076636-99-000019.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079195/0001076636-99-000021.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/1079187/0001076636-99-000022.txt | 125649202 | 317438 | CLX ENERGY INC | C
 edgar/data/317438/0001050502-02-000601.txt  |           | 317438 | CLX ENERGY INC | C
 edgar/data/1078649/0001050502-02-000601.txt |           | 317438 | CLX ENERGY INC | C
(24 rows)

crsp=# ;
crsp=# SELECT * FROM edgar.cusip_cik
crsp-# WHERE file_name = 'edgar/data/317438/0001076636-99-000021.txt';
 file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)

crsp=# SELECT * FROM edgar.filings
WHERE file_name = 'edgar/data/317438/0001076636-99-000021.txt';
 company_name | form_type | cik | date_filed | file_name
--------------+-----------+-----+------------+-----------
(0 rows)

So where have the filings gone, if their no longer in the bulk extract?

bdcallen commented 4 years ago

@iangow

> old_filings_lost <- cusip_cik_old %>% anti_join(cusip_cik, by = 'file_name')
> old_filings_lost %>% distinct(cik, cusip) %>% inner_join(old_matches_lost)
Joining, by = c("cik", "cusip")
# Source:   lazy query [?? x 3]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      cik cusip         n
    <int> <chr>     <dbl>
 1  20639 023164710    18
 2  25890 228          16
 3  57201 22406P10     24
 4  59593 5351711      42
 5  81061 None         22
 6  89140 817732321    34
 7 105936 95751D310    12
 8 108516 9818111      21
 9 278128 NONE         14
10 313499 NONE         12
# … with more rows
> old_filings_lost %>% distinct(cik, cusip) %>% inner_join(old_matches_lost) %>% count()
Joining, by = c("cik", "cusip")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    93
> filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))
> old_filings_lost %>% inner_join(filings, by = 'file_name') 
# Source:   lazy query [?? x 9]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
  file_name                                   cusip       cik.x company_name.x            format company_name.y              form_type   cik.y date_filed
  <chr>                                       <chr>       <int> <chr>                     <chr>  <chr>                       <chr>       <int> <date>    
1 edgar/data/1332905/0001172661-17-000448.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C      RMB Capital Management, LLC SC 13G    1332905 2017-02-13
2 edgar/data/1094742/0001172661-17-000448.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C      MAINSTREET BANKSHARES INC   SC 13G    1094742 2017-02-13
3 edgar/data/1332905/0001172661-18-000468.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C      RMB Capital Management, LLC SC 13G/A  1332905 2018-02-13
4 edgar/data/1094742/0001172661-18-000468.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C      MAINSTREET BANKSHARES INC   SC 13G/A  1094742 2018-02-13
> old_filings_lost %>% inner_join(filings, by = 'file_name') %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1     4
> old_filings_lost %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 39984

So the vast majority of filings in cusip_cik_old which are missing in cusip_cik are in fact no longer in filings either.

bdcallen commented 4 years ago

@iangow So the only four filings still in filings account for one cusip, cik pair (cik.x is the cik in cusip_cik_old). So 92 of the 415 pairs are from missing filings.

bdcallen commented 4 years ago

@iangow

> old_matches_lost %>% filter(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip)))  %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% count()
Joining, by = c("cik", "cusip")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    82

There seem to be a substantion number of cases with stopwords, or useless cusips (minimum number of characters should be 6, because that is the minimum needed from a cusip to specify an issuer).

bdcallen commented 4 years ago

@iangow matches by case adds another 42

> case_matches <- old_matches_lost %>% mutate(cusip = toupper(cusip)) %>% inner_join(cusip_cik) 
Joining, by = c("cik", "cusip")
> case_matches %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip)))) %>% count()
Joining, by = c("cik", "cusip")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  3143
> case_matches %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip)))) %>% distinct(cik, cusip) %>% count()
Joining, by = c("cik", "cusip")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    42

So the problematic cases come to around 93 + 42 + 82 = 217, which is around half the cases in old_matches_lost

bdcallen commented 4 years ago

@iangow

> outstanding_df <- old_matches_lost %>% anti_join(case_matches %>% distinct(cik, cusip)) %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common|schedule|inapplicable)', tolower(cusip)))) %>% distinct(cik, cusip)
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% cusip.x) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    69

So for an additional 69 cases, the cusip in the old table is contained inside a cusip of the new table.

bdcallen commented 4 years ago

@iangow

> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    24
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y & cusip.y %~% cusip.x) %>% distinct(cik, cusip.x) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1     0

and an additional 24 in which the new cusip is containd in the old one. I should check these

bdcallen commented 4 years ago
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x, cusip.y) 
# Source:   lazy query [?? x 3]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
       cik cusip.x   cusip.y 
     <int> <chr>     <chr>   
 1   65297 05873J101 05873J10
 2  808015 59484E1   59484E  
 3  823546 088861L1  088861L 
 4  831532 450707101 45070710
 5  852952 9196105   919610  
 6 1005757 126341091 12634109
 7 1009356 795435100 79543510
 8 1010418 896411610 89641161
 9 1012477 03384881  338488  
10 1015172 05462R10  05462R1 
# … with more rows
> new_has_old <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% cusip.x) %>% distinct(cik, cusip.x) 
> old_has_new <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x) 
> old_has_new %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    24
> old_has_new %>% inner_join(new_has_old) 
Joining, by = c("cik", "cusip.x")
# Source:   lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      cik cusip.x 
    <int> <chr>   
1  808015 59484E1 
2 1012477 03384881
3 1015172 05462R10
4 1034669 3377562 
5 1110650 G393001 
6 1119744 M975401 
7 1282977 3952591 
8 1436351 74623L1 
> old_has_new %>% inner_join(new_has_old) %>% count()
Joining, by = c("cik", "cusip.x")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1     8

So a third of the old_has_new cases match other cusips in new_has_old

bdcallen commented 4 years ago

So, I expanded the definition for new_has_old and old_has_new, I noticed I didn't account for the cases where the the old cusip had letters in lowercase. Thus I did the following

> new_has_old <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% toupper(cusip.x)) %>% distinct(cik, cusip.x) 
> old_has_new <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(toupper(cusip.x) %~% cusip.y) %>% distinct(cik, cusip.x) 
> old_has_new %>% anti_join(new_has_old)
Joining, by = c("cik", "cusip.x")
# Source:   lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
       cik cusip.x  
     <int> <chr>    
 1   65297 05873J101
 2  823546 088861L1 
 3  831532 450707101
 4  852952 9196105  
 5 1005757 126341091
 6 1009356 795435100
 7 1010418 896411610
 8 1038186 09255C403
 9 1089590 645378H10
10 1113679 873942A10
# … with more rows
> new_has_old %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   124
> old_has_new %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    77
> outstanding_df %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   237
> old_has_new %>% inner_join(new_has_old) %>% count()
Joining, by = c("cik", "cusip.x")
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    63
> old_has_new %>% inner_join(new_has_old) 
Joining, by = c("cik", "cusip.x")
# Source:   lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      cik cusip.x  
    <int> <chr>    
 1 724606 69840w108
 2 745287 45881k104
 3 749647 15117n305
 4 770944 05873k108
 5 808015 59484E1  
 6 811640 20602t106
 7 812890 58445p105
 8 816761 88076w103
 9 826326 03878k207
10 860510 29874m103
# … with more rows
> old_has_new %>% inner_join(new_has_old) %>% collect()
Joining, by = c("cik", "cusip.x")
# A tibble: 63 x 2
      cik cusip.x  
 *  <int> <chr>    
 1 724606 69840w108
 2 745287 45881k104
 3 749647 15117n305
 4 770944 05873k108
 5 808015 59484E1  
 6 811640 20602t106
 7 812890 58445p105
 8 816761 88076w103
 9 826326 03878k207
10 860510 29874m103
# … with 53 more rows
> old_has_new %>% anti_join(new_has_old) %>% collect()
Joining, by = c("cik", "cusip.x")
# A tibble: 14 x 2
       cik cusip.x  
 *   <int> <chr>    
 1   65297 05873J101
 2  823546 088861L1 
 3  831532 450707101
 4  852952 9196105  
 5 1005757 126341091
 6 1009356 795435100
 7 1010418 896411610
 8 1038186 09255C403
 9 1089590 645378H10
10 1113679 873942A10
11 1125259 143658300
12 1348259 BMG9319H1
13 1488775 1846922  
14 1488775 1846923  

I then looked for the cusip BMG9319H1 as an example

crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip  = 'BMG9319H1';
 file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)

crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip  = 'BMG9319H1';
                  file_name                  |   cusip   |   cik   |     company_name     | format
---------------------------------------------+-----------+---------+----------------------+--------
 edgar/data/70858/0000903423-09-000136.txt   | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000947871-08-000163.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0001297402-08-000008.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/65100/0000947871-08-000163.txt   | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000898286-08-000043.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000903423-09-000136.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000903423-10-000083.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000903423-11-000086.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/1348259/0000903423-12-000092.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/70858/0000903423-10-000083.txt   | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/70858/0000903423-11-000086.txt   | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/70858/0000903423-12-000092.txt   | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/898286/0001297402-08-000008.txt  | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
 edgar/data/898286/0000898286-08-000043.txt  | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
(14 rows)

crsp=# SELECT * FROM edgar.filings
crsp-# WHERE file_name = '
crsp'# crsp=#
crsp'# ;
crsp'# ';
 company_name | form_type | cik | date_filed | file_name
--------------+-----------+-----+------------+-----------
(0 rows)

crsp=# SELECT * FROM edgar.filings
WHERE file_name = 'edgar/data/1348259/0000903423-09-000136.txt';
     company_name     | form_type |   cik   | date_filed |                  file_name
----------------------+-----------+---------+------------+---------------------------------------------
 VALIDUS HOLDINGS LTD | SC 13G/A  | 1348259 | 2009-02-18 | edgar/data/1348259/0000903423-09-000136.txt
(1 row)

crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/1348259/0000903423-09-000136.txt';
                  file_name                  |   cusip   | check_digit |   cik   |     company_name     | formats
---------------------------------------------+-----------+-------------+---------+----------------------+---------
 edgar/data/1348259/0000903423-09-000136.txt | 9319H1025 |           1 | 1348259 | VALIDUS HOLDINGS LTD | AB
(1 row)

Here's the filing at the end . So this is an interesting case where a filer has put in a cusip with 12 digits! The old table seems to have caught the first 9 digts, and the new table the last 9.

bdcallen commented 4 years ago

@iangow Here's an example for the first row of the 14 cases from old_has_new which are not in new_has_old, for the pair cik = 65297, cusip = 05873J101.

crsp=# SELECT * FROM edgar.cusip_cik
crsp-# WHERE cusip = '05873J101';
 file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)

crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip = '05873J101';
                 file_name                  |   cusip   |  cik  |     company_name      | format
--------------------------------------------+-----------+-------+-----------------------+--------
 edgar/data/65297/0000315066-94-001914.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-94-001509.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-94-001509.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-94-001914.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-95-002519.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-95-001152.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-95-002396.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-95-002399.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-95-002399.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-95-001152.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-95-002396.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-96-000094.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-96-000094.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-96-001295.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/65297/0000315066-96-001305.txt  | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-95-002519.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-96-001295.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
 edgar/data/315066/0000315066-96-001305.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
(18 rows)

crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/65297/0000315066-95-002399.txt';
                 file_name                 | cusip | check_digit |  cik  |     company_name      | formats
-------------------------------------------+-------+-------------+-------+-----------------------+---------
 edgar/data/65297/0000315066-95-002399.txt |       |             | 65297 | BALLYS GRAND INC /DE/ |
(1 row)

For this case, my python code seems to have missed it because the relevant line has Cusip instead of CUSIP, since cusip_hdr in my code is

cusip_hdr = r'CUSIP\s+(?:No\.|NO\.|#|Number|NUMBER):?'

However, cusip_hdr is not that different in the original Perl code

# Regular expressions
$cusip_hdr = 'CUSIP\s+(?:No\.|#|Number):?';

Is string matching via regular expressions case insensitive in Perl? Or have I missed a line where the text was converted to upper or lower case?

bdcallen commented 4 years ago

@iangow Here's another example, this time for cik = 831532, cusip = 450707101.

crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip = '450707101';
 file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)

crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip ~ '450707101';
 file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)

crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip = '450707101';
                  file_name                  |   cusip   |  cik   |      company_name       | format
---------------------------------------------+-----------+--------+-------------------------+--------
 edgar/data/831532/0000950124-97-005515.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-97-006027.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-97-006224.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-97-006313.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-004799.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-97-004799.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-99-004018.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-97-002109.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-002109.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-005515.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-006027.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-006224.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-97-006313.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-98-007145.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-98-002563.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/831532/0000950124-98-003097.txt  | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-98-002563.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-98-003097.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-99-004018.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
 edgar/data/1037148/0000950124-98-007145.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
(20 rows)

crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/1037148/0000950124-97-006313.txt';
                  file_name                  |   cusip   | check_digit |  cik   |      company_name       | formats
---------------------------------------------+-----------+-------------+--------+-------------------------+---------
 edgar/data/1037148/0000950124-97-006313.txt | 04962P102 |           2 | 831532 | ATRIX INTERNATIONAL INC | A
 edgar/data/1037148/0000950124-97-006313.txt | 45070710  |           4 | 831532 | ATRIX INTERNATIONAL INC | C
(2 rows)

So this example seems to have two cusips, with the latter one being the one of interest with regards to the matching. The latter cusip appears at the start of the cover page and it seems to be badly formatted, with the issue number (the 7th and 8th digit, 10) being duplicated. Hence, the '1' at the end of the old cusip. The new table contains the first 8 digits of the latter cusip, and with the check_digit column, it can be inferred that the proper candidate is 450707104.

Both the cusips found for this filing appear in stocknames, so as an aside it's another example of multiple cusips with one of them not being the proper cusip for the stock in question

> stocknames %>% filter(ncusip == "45070710")
# Source:   lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam               hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  79706  12460 1993-10-06 1998-06-16 68272W10 45070710 IVIP   I V I PUBLISHING INC     3      3  7370    11 NA     1993-10-29 2000-09-29       2
> stocknames %>% filter(ncusip == "04962P10")
# Source:   lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam                  hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                   <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  76883  10884 1991-04-10 1997-12-15 04962P30 04962P10 ATXI   ATRIX INTERNATIONAL INC     3      3  3420    11 NA     1991-04-30 1999-06-30       2
bdcallen commented 4 years ago

I then looked for the cusip BMG9319H1 as an example

This one appears to have come from an ISIN

Screenshot (36)

Futhermore, the ISIN is derived from the cusip, with the first two digits standing for the country, the next 9 being the actual cusip, then the last being an additional check digit. Hence I did

> stocknames %>% filter(cusip == "G9319H10")
# Source:   lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam               hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  92191  52537 2007-07-25 2018-07-17 G9319H10 G9319H10 VR     VALIDUS HOLDINGS LTD     1      1  6331    12 NA     2007-07-31 2018-07-31       2

So it appears the correct cusip for this case is in fact G9319H102. In fact this cusip appears under the appropriate cik (1348259) in cusip_cik

crsp=# SELECT COUNT(*) FROM edgar.cusip_cik
WHERE cusip = 'G9319H102'
crsp-# AND cik = 1348259;
 count
-------
    74
(1 row)
iangow commented 4 years ago

OK. Here's a summary of what I think you have discovered. Let me know if I missed anything. It may make sense to make a new issue for each of the following and to implement them in some order (e.g., "easy to do and important to do" would come first); see below for some more ideas.

  1. We've lost cases with Cusip instead of CUSIP. I would just add Cusip back as an explicit alternative. Perl regular expressions are (of course) case-sensitive, but there may be a point in the code where I turned this off. (In fact, I think the i flag at the end of ($lines =~ /($cusip_fmt)\s+(?:[_-]{9,})?\s*\(CUSIP Number\)/si) has this effect. It's been a while, so I don't remember. We may want to be more judicious than simply eliminating case-sensitivity, though there's no doubt that CuSiP NUmbER can only refer to a CUSIP!)
  2. We've lost cases where the CUSIP included characters in [a-z]. I think we should do what you suggest and look for those, but I think we should convert them to upper case before storing them if that's what CUSIPs should be.
  3. We've lost cases where there is no longer a listing in edgar.filings. These surprise me, as I thought entries were not deleted from the underlying data. In this case, we have nothing in edgar.filings even though the filing is still there. I think there's not much to be done about these for this issue. (It may warrant separate investigation just to understand the scale of the issue, however.)
  4. We have different results when the CUSIPs over 9 digits. I think the solution here is to write the regular expressions to capture the full CUSIP and to ditch CUSIPs that are too long to be valid. Capturing part of a twelve-digit CUSIP is just a recipe for bad matches in my view.
  5. We have issues with filings with multiple CUSIPs. This is addressed in another issue (#77); let's keep that there.
  6. We have some CUSIPs that are actually ISINs (these will appear as "bad CUSIPs" if we apply the idea in (4) above). I'm not sure we want to bother with these. Only if we had a number of valid ISINs that yielded CUSIP-CIK matches that we wouldn't otherwise get would be get into the business of detecting ISINs and converting them to CUSIPs.

I think it may be helpful to have a table that contains a sample of filings for testing. This need not be a random sample (e.g., you could oversample cases where we seem to have issues), but it might be good to have some randomly selected filings in there. I think 1000 or 10000 fillings would probably be enough. The idea would be that you could change the code, then run it on the test sample and compare with the results of the prior run of code before committing the code.

My assumption is that when we change the code we really need to go run it on the whole set of filings again so that we have some degree of consistency. My assumption is that doing so is expensive, so we want to accumulate several changes to the code before we go do that.

iangow commented 4 years ago

@bdcallen If we have captured all the issues raised here elsewhere (even if just by making a new issue), then we should close this one. If I have correctly summarized the above here and we have issues for each item I raised there, then I think this can be closed. (I'll close this assume that this is true.)