mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Examine CUSIPs that match to multiple CIKs #90

Closed iangow closed 4 years ago

iangow commented 4 years ago

@bdcallen This builds on one aspect of #89. My question is: Are any of the matches in problems valid? (Note that I ran this code on a local copy of RStudio using reprex::reprex() so that I can paste code that can be run. I also ran it on the Z640 machine by first copying from here, then running reprex::reprex_clean to clean it up, then pasting in an RStudio Server source window, highlighting, then selecting Reprex selection from the Addins menu in RStudio Server.)

Sys.setenv(PGHOST="10.101.13.99", PGDATABASE="crsp", PGPORT=5432L)

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
rs <- dbExecute(pg, "SET work_mem TO '8GB'")

cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik
#> # Source:   table<cusip_cik> [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name              cusip   check_digit    cik company_name        formats
#>    <chr>                  <chr>         <int>  <int> <chr>               <chr>  
#>  1 edgar/data/828750/000… 37937A…           7 8.29e5 GLOBAL SPORTS INC   AC     
#>  2 edgar/data/828750/000… 783756…           9 8.29e5 GLOBAL SPORTS INC   C      
#>  3 edgar/data/946392/000… 37936U…           4 9.46e5 GLOBAL TELESYSTEMS… ABC    
#>  4 edgar/data/946392/000… <NA>             NA 9.46e5 GLOBAL TELESYSTEMS… <NA>   
#>  5 edgar/data/946392/000… 37936U…           4 9.46e5 GLOBAL TELESYSTEMS… C      
#>  6 edgar/data/946392/000… 37936U…           4 9.46e5 GLOBAL TELESYSTEMS… C      
#>  7 edgar/data/832324/000… 37937R            1 8.32e5 GLOBALNET FINANCIA… ABC    
#>  8 edgar/data/772129/000… 931210…           3 9.13e5 WALDEN RESIDENTIAL… ACD    
#>  9 edgar/data/1029553/00… 383486…           7 1.03e6 GO2NET INC          ABCD   
#> 10 edgar/data/1030949/00… 37936U…           4 9.46e5 GLOBAL TELESYSTEMS… ABC    
#> # … with more rows

cusip_9s <- 
    cusip_cik %>%
    filter(nchar(cusip)==9L) %>%
    filter(right(cusip, 1L)==as.character(check_digit)) %>%
    compute()

cusip_9s_mult <-
    cusip_9s %>%
    group_by(cusip, cik) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    filter(count >= 10) %>%
    compute()

problems <- 
    cusip_9s_mult %>% 
    group_by(cusip) %>% 
    filter(n() > 1) %>% 
    ungroup() %>%
    compute()
#> Warning: `lang_name()` is deprecated as of rlang 0.2.0.
#> Please use `call_name()` instead.
#> This warning is displayed once per session.
#> Warning: `lang()` is deprecated as of rlang 0.2.0.
#> Please use `call2()` instead.
#> This warning is displayed once per session.

problems
#> # Source:   table<dbplyr_007> [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    cusip         cik count  
#>    <chr>       <int> <int64>
#>  1 000000000 1055160  16    
#>  2 000000000 1043219  10    
#>  3 000000000  730464  10    
#>  4 000000000 1396446  12    
#>  5 000000000 1026980  10    
#>  6 001282102 1413898 162    
#>  7 001282102  356080  16    
#>  8 001547108  918192  16    
#>  9 001547108  918160 226    
#> 10 001693100 1048740  10    
#> # … with more rows

problems %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 1151

problems %>% 
    filter(substr(cusip, 1L, 6L) == "464287")
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   cusip         cik count  
#>   <chr>       <int> <int64>
#> 1 464287176 1006249 10     
#> 2 464287176 1100663 10     
#> 3 464287432 1159508 38     
#> 4 464287432 1100663 12     
#> 5 464287481 1006249 10     
#> 6 464287481 1100663 16

Created on 2020-05-13 by the reprex package (v0.3.0)

bdcallen commented 4 years ago

@iangow Looks like there are a bunch of repeat offenders, which file their own cusips, not those of the security in which they're claiming beneficial ownership, with Wachovia Corp, the example I gave in another issue, right at the top

View(cusip_cik %>% filter(cusip_length == 9) %>% 
              filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
              distinct(cik, cusip) %>% 
              group_by(cusip) %>% summarise(num_ciks = n()) %>% 
              arrange(desc(num_ciks)) %>% mutate(ncusip = substr(cusip, 1, 8)) %>%
              inner_join(stocknames, by = 'ncusip') %>% 
              distinct(cusip.x, num_ciks, comnam) %>% rename(cusip = cusip.x) %>% slice(1:20)
cusip num_ciks comnam
1 929903102 251 WACHOVIA CORP 2ND NEW
2 337358105 151 FIRST UNION CORP
3 769667106 118 RIVUS BOND FUND
4 053469102 82 AVALON PROPERTIES INC
5 94856P102 44 WEEKS CORP
6 253798102 39 DIGI INTERNATIONAL INC
7 232109108 26 CUTERA INC
8 871607107 23 SYNOPSYS INC
9 320228109 22 FIRST FINANCIAL FUND INC
10 693315103 22 P H C INC MA
11 358433100 21 FRIEDMAN BILLINGS RAMSEY GRP IN
12 521863308 20 LEAP WIRELESS INTL INC
13 03232P405 18 AMSURG CORP
14 75966V105 17 RENAISSANCE CAP GR & INC FD III
15 957070105 17 WESTAFF INC
16 749651105 16 R O C TAIWAN FUND
17 358434108 15 FRIEDMAN BILLINGS RAMSEY GRP NE
18 02043Q107 13 ALNYLAM PHARMACEUTICALS INC
19 435763107 13 HOLLY ENERGY PARTNERS L P
20 775371107 13 ROHM & HAAS CO
bdcallen commented 4 years ago

@iangow

cusip_cik %>% filter(cusip6 == '73935X' & company_name == 'POWERSHARES EXCHANGE TRADED FUND TRUST') %>% 
              distinct(file_name) %>% inner_join(cusip_cik) %>% 
              select(file_name, cik, cusip, cusip6, multiplicity, check_digit, company_name, formats) %>% View()

Looking at the first few from this view we get

file_name cik cusip cusip6 multiplicity check_digit company_name formats
1 edgar/data/1209466/0000072971-09-000028.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
2 edgar/data/1209466/0000072971-09-000575.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
3 edgar/data/1209466/0000072971-10-000010.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
4 edgar/data/72971/0000072971-09-000028.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
5 edgar/data/72971/0000072971-10-000010.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
6 edgar/data/72971/0000072971-09-000575.txt 1209466 73935X146 73935X 6 6 POWERSHARES EXCHANGE TRADED FUND TRUST ABC
7 edgar/data/1209466/0001074683-08-000118.txt 1209466 73935X153 73935X 8 3 POWERSHARES EXCHANGE TRADED FUND TRUST AB
8 edgar/data/1209466/0001074683-08-000118.txt 1209466 769667106 769667 256 6 POWERSHARES EXCHANGE TRADED FUND TRUST C
9 edgar/data/1209466/0001074683-08-000118.txt 1209466 929903102 929903 324 2 POWERSHARES EXCHANGE TRADED FUND TRUST D
10 edgar/data/36995/0001074683-08-000118.txt 1209466 73935X153 73935X 8 3 POWERSHARES EXCHANGE TRADED FUND TRUST AB
11 edgar/data/36995/0001074683-08-000118.txt 1209466 769667106 769667 256 6 POWERSHARES EXCHANGE TRADED FUND TRUST C
12 edgar/data/36995/0001074683-08-000118.txt 1209466 929903102 929903 324 2 POWERSHARES EXCHANGE TRADED FUND TRUST D
13 edgar/data/1209466/0001074683-08-000447.txt 1209466 73935X153 73935X 8 3 POWERSHARES EXCHANGE TRADED FUND TRUST AB
14 edgar/data/1209466/0001074683-08-000447.txt 1209466 769667106 769667 256 6 POWERSHARES EXCHANGE TRADED FUND TRUST C
15 edgar/data/1209466/0001074683-08-000447.txt 1209466 929903102 929903 324 2 POWERSHARES EXCHANGE TRADED FUND TRUST D
16 edgar/data/36995/0001074683-08-000447.txt 1209466 73935X153 73935X 8 3 POWERSHARES EXCHANGE TRADED FUND TRUST AB
17 edgar/data/36995/0001074683-08-000447.txt 1209466 769667106 769667 256 6 POWERSHARES EXCHANGE TRADED FUND TRUST C
18 edgar/data/36995/0001074683-08-000447.txt 1209466 929903102 929903 324 2 POWERSHARES EXCHANGE TRADED FUND TRUST D
19 edgar/data/36995/0001074683-08-000556.txt 1209466 73935X153 73935X 8 3 POWERSHARES EXCHANGE TRADED FUND TRUST AB
20 edgar/data/36995/0001074683-08-000556.txt 1209466 769667106 769667 256 6 POWERSHARES EXCHANGE TRADED FUND TRUST C
21 edgar/data/36995/0001074683-08-000556.txt 1209466 929903102 929903 324 2 POWERSHARES EXCHANGE TRADED FUND TRUST D

We can see one of the filings, edgar/data/1209466/0001074683-08-000118.txt has three cusips, 73935X153, 769667106 and the cusip for Wachovia Corp, 929903102. 73935X153 is the correct one

> stocknames %>% filter(ncusip == '73935X15')
# A tibble: 3 x 16
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam                  hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                   <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  91876  44072 2007-03-01 2007-12-30 46137V83 73935X15 PDP    POWERSHARES E T F TRUST     3      1  6726    73 NA     2007-03-30 2019-12-31       2
2  91876  44072 2007-12-31 2016-12-06 46137V83 73935X15 PDP    POWERSHARES E T F TRUST     3      4  6726    73 NA     2007-03-30 2019-12-31       2
3  91876  44072 2016-12-07 2018-06-03 46137V83 73935X15 PDP    POWERSHARES E T F TRUST     3      3  6726    73 NA     2007-03-30 2019-12-31       2

> issuers %>% filter(cusip6 == '73935X') %>% select(cusip6, issuer_name_1, issuer_name_2)
# A tibble: 1 x 3
  cusip6 issuer_name_1                  issuer_name_2
  <chr>  <chr>                          <chr>        
1 73935X POWERSHARES EXCHANGE-TRADED FD TR

and the other cusip, 769667106, is for a different company, RIVUS BOND FUND.

> stocknames %>% filter(ncusip == '76966710')
# A tibble: 1 x 16
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam          hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>           <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  54228  20601 2006-10-10 2011-12-08 45781W10 76966710 BDF    RIVUS BOND FUND     1      1  6726    14 NA     1972-02-29 2019-12-31       2

The reason why the cusip for Rivus Bond Fund would appear is still somewhat mysterious to me. It clearly doesn't share the same cik as for Powershare though

> ciks %>% filter(company_name == 'RIVUS BOND FUND')
# A tibble: 1 x 2
    cik company_name   
  <int> <chr>          
1 30125 RIVUS BOND FUND
> ciks %>% filter(company_name == 'POWERSHARES EXCHANGE TRADED FUND TRUST')
# A tibble: 1 x 2
      cik company_name                          
    <int> <chr>                                 
1 1209466 POWERSHARES EXCHANGE TRADED FUND TRUST
bdcallen commented 4 years ago

@iangow It's seem multiplicity doesn't help with with Rivus, either by looking at ciks for the one cusip,

cusip_cik %>% distinct(cusip, cik, multiplicity) %>% filter(cusip == '769667106') %>% left_join(ciks) %>% arrange(desc(multiplicity)) %>% View()
cusip cik multiplicity company_name
1 769667106 1209466 256 INVESCO EXCHANGE-TRADED FUND TRUST
2 769667106 1209466 256 POWERSHARES EXCHANGE TRADED FUND TRUST
3 769667106 1100663 168 ISHARES TRUST
4 769667106 1064642 92 INDEX EXCHANGE LISTED SECURITIES TRUST
5 769667106 1064642 92 SPDR SERIES TRUST
6 769667106 1064642 92 STREETTRACKS SERIES TRUST
7 769667106 52848 76 VANGUARD WORLD FUND INC
8 769667106 52848 76 VANGUARD WORLD FUND
9 769667106 52848 76 VANGUARD WORLD FUNDS
10 769667106 1350487 70 WISDOMTREE TRUST
11 769667106 1064641 46 SECTOR SPDR TRUST
12 769667106 1064641 46 SELECT SECTOR SPDR TRUST
13 769667106 930667 42 FOREIGN FUND INC
14 769667106 930667 42 ISHARES INC
15 769667106 930667 42 ISHARES, INC.
16 769667106 930667 42 WEBS INDEX FUND INC
17 769667106 930667 42 WFBS INDEX FUND INC
18 769667106 1378872 40 INVESCO EXCHANGE-TRADED FUND TRUST II
19 769667106 1378872 40 POWERSHARES EXCHANGE-TRADED FUND TRUST II
20 769667106 1378872 40 POWERSHARES GLOBAL EXCHANGE-TRADED FUND TRUST
21 769667106 1125816 38 FIRST TRUST ADVISORS LP
22 769667106 893818 34 BARCLAYS GLOBAL INVESTORS FUNDS INC
23 769667106 893818 34 BARCLAYS GLOBAL INVESTORS FUNDS
24 769667106 893818 34 BLACKROCK FUNDS III
25 769667106 893818 34 MASTERWORKS FUNDS INC
26 769667106 893818 34 STAGECOACH INC
27 769667106 30125 30 1838 BOND DEBENTURE TRADING FUND
28 769667106 30125 30 CUTWATER SELECT INCOME FUND
29 769667106 30125 30 INSIGHT SELECT INCOME FUND
30 769667106 30125 30 RIVUS BOND FUND

or cusips for the one cik

> cusip_cik %>% distinct(cusip, cik, multiplicity) %>% filter(cik == 30125) %>% arrange(desc(multiplicity))
# A tibble: 5 x 3
  cusip       cik multiplicity
  <chr>     <int>        <int>
1 232229104 30125           34
2 769667106 30125           30
3 45781W109 30125           16
4 282520105 30125           14
5 929903102 30125           10
bdcallen commented 4 years ago

@iangow Out of the 5 cusips in the latter part of the previous post, 4 join onto issuers, and three of them are related to Rivus by name changes. The remaining one, 45781W109, joins onto stocknames, and seems to be an older cusip

> cusip_cik %>% distinct(cusip, cik, multiplicity) %>% 
      filter(cik == 30125) %>% arrange(desc(multiplicity)) %>% 
      mutate(cusip6 = substr(cusip, 1, 6)) %>% inner_join(issuers) %>% 
      select(cusip, cik, multiplicity, issuer_name_1, issuer_adl_1, issuer_adl_2, issuer_adl_3)
Joining, by = "cusip6"
# A tibble: 4 x 7
  cusip       cik multiplicity issuer_name_1             issuer_adl_1                 issuer_adl_2                   issuer_adl_3
  <chr>     <int>        <int> <chr>                     <chr>                        <chr>                          <chr>       
1 232229104 30125           34 CUTWATER SELECT INCOME FD NA                           NA                             NA          
2 769667106 30125           30 RIVUS BD FD               NAME CHANGED TO CUTWATER     SELECT INCOME FD 12/09/2011    SEE 232229  
3 282520105 30125           14 1838 BD-DEB TRADING FD    NAME CHANGED TO RIVUS BD FD  07/07/2006          SEE 769667 NA          
4 929903102 30125           10 WACHOVIA CORP NEW         MERGED INTO WELLS FARGO & CO NEW 01/01/2009                 NA    

> stocknames %>% filter(ncusip == '45781W10')
# A tibble: 1 x 16
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam                   hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  54228  20601 2016-12-29 2019-12-31 45781W10 45781W10 INSI   INSIGHT SELECT INCOME FD     1      1  6726    14 NA     1972-02-29 2019-12-31       2
> stocknames %>% filter(ncusip == '45781W10' | cusip == '45781W10')
# A tibble: 5 x 16
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam                          hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>                           <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  54228  20601 1972-02-22 1988-10-06 45781W10 26175510 DBF    DREXEL BOND DEBENTURE TRADING F     1      1  6723    14 NA     1972-02-29 2019-12-31       2
2  54228  20601 1988-10-07 2006-10-09 45781W10 28252010 BDF    1838 BOND DEBENTURE TRADING FUN     1      1  6726    14 NA     1972-02-29 2019-12-31       2
3  54228  20601 2006-10-10 2011-12-08 45781W10 76966710 BDF    RIVUS BOND FUND                     1      1  6726    14 NA     1972-02-29 2019-12-31       2
4  54228  20601 2011-12-09 2016-12-28 45781W10 23222910 CSI    CUTWATER SELECT INCOME FD           1      1  6726    14 NA     1972-02-29 2019-12-31       2
5  54228  20601 2016-12-29 2019-12-31 45781W10 45781W10 INSI   INSIGHT SELECT INCOME FD            1      1  6726    14 NA     1972-02-29 2019-12-31       2

So there is a part to this issue that some companies change their cusip6 identifier over time.

bdcallen commented 4 years ago

Here's further confirmation that the other four names are valid historical names for the given cik

> ciks %>% filter(cik == 30125)
# A tibble: 4 x 2
    cik company_name                    
  <int> <chr>                           
1 30125 1838 BOND DEBENTURE TRADING FUND
2 30125 CUTWATER SELECT INCOME FUND     
3 30125 INSIGHT SELECT INCOME FUND      
4 30125 RIVUS BOND FUND   
bdcallen commented 4 years ago

Noting that the sum of the multiplicities for the valid cusips associated with the valid names is 34 + 30 + 16 + 14 = 94, versus 10 for Wachovia Corp, perhaps multiplicity should help us for less complicated cases.

bdcallen commented 4 years ago

Looks like with simpler cases (in this example, num_ciks = 2), multiplicity is handy. There does seem to be the odd cases where the order of magnitude of the difference in multiplicity between ciks is not that great (eg. cusip = 000573105)

> cusip_cik %>% filter(cusip_length == 9) %>% 
                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
                         distinct(cik, cusip) %>% 
                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
                       inner_join(issuers) %>% filter(num_ciks == 2) %>% 
                       inner_join(cusip_cik) %>% left_join(ciks) %>% 
                      distinct(cusip, cik, num_ciks, multiplicity, company_name,  
                                   issuer_name_1, issuer_name_2, issuer_name_3) %>% 
                      arrange(cusip) %>% head(20)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 20 x 8
   cusip         cik num_ciks multiplicity company_name                                  issuer_name_1               issuer_name_2 issuer_name_3
   <chr>       <int>    <int>        <int> <chr>                                         <chr>                       <chr>         <chr>        
 1 000361105    1750        2          320 AAR CORP                                      AAR CORP                    NA            NA           
 2 000361105 1141107        2            2 ARRIS GROUP INC                               AAR CORP                    NA            NA           
 3 000573105  704051        2            3 LEGG MASON INC                                ABC DISPENSING TECHNOLOGIES INC           NA           
 4 000573105  748103        2            6 ABC DISPENSING TECHNOLOGIES INC               ABC DISPENSING TECHNOLOGIES INC           NA           
 5 000752105  913364        2           68 ABC NACO INC                                  ABC RAIL PRODS CORP         NA            NA           
 6 000752105  913364        2           68 ABC RAIL PRODUCTS CORP                        ABC RAIL PRODS CORP         NA            NA           
 7 000752105  923364        2            2 H&N MANAGEMENT CO INC                         ABC RAIL PRODS CORP         NA            NA           
 8 000781104  313368        2           10 ABS INDUSTRIES INC /DE/                       ABS INDS INC                NA            NA           
 9 000781104  806063        2            2 ABS LIFESCIENCES INC                          ABS INDS INC                NA            NA           
10 00080S101  845289        2            2 HARVEST NATURAL RESOURCES, INC.               ABX HLDGS INC               NA            NA           
11 00080S101  894081        2           56 ABX AIR INC                                   ABX HLDGS INC               NA            NA           
12 00080S101  894081        2           56 ABX Holdings, Inc.                            ABX HLDGS INC               NA            NA           
13 000881102 1040792        2           56 A CONSULTING TEAM INC                         A CONSULTING TEAM INC       NA            NA           
14 000881102 1363448        2            2 Helios & Matheson Information Technology Ltd. A CONSULTING TEAM INC       NA            NA           
15 000955104  215155        2            2 ASTROCOM CORP                                 ACT TELECONFERENCING INC    NA            NA           
16 000955104  918709        2           81 ACT TELECONFERENCING INC                      ACT TELECONFERENCING INC    NA            NA           
17 000957100  356037        2            2 CSP INC /MA/                                  ABM INDS INC                NA            NA           
18 000957100  771497        2          226 ABM INDUSTRIES INC /DE/                       ABM INDS INC                NA            NA           
19 000975102  316236        2            2 GRAND UNION CO /DE/                           ACT NETWORKS INC            NA            NA           
20 000975102  942132        2           32 ACT NETWORKS INC                              ACT NETWORKS INC            NA            NA           
bdcallen commented 4 years ago

Here's the same analysis for num_ciks = 3

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 3) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% head(20)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 20 x 8
   cusip         cik num_ciks multiplicity company_name                    issuer_name_1               issuer_name_2 issuer_name_3
   <chr>       <int>    <int>        <int> <chr>                           <chr>                       <chr>         <chr>        
 1 000360206  105634        3            2 EMCOR GROUP INC                 AAON INC                    NA            NA           
 2 000360206  706688        3            2 AARON'S INC                     AAON INC                    NA            NA           
 3 000360206  824142        3          142 AAON INC                        AAON INC                    NA            NA           
 4 00086T103  787030        3            2 VISTACARE, INC.                 A C MOORE ARTS & CRAFTS INC NA            NA           
 5 00086T103 1042809        3          196 A C MOORE ARTS & CRAFTS INC     A C MOORE ARTS & CRAFTS INC NA            NA           
 6 00086T103 1042809        3          196 A.C. Moore Arts & Crafts, Inc.  A C MOORE ARTS & CRAFTS INC NA            NA           
 7 00086T103 1385534        3            1 TIGER CONSUMER MANAGEMENT, LLC  A C MOORE ARTS & CRAFTS INC NA            NA           
 8 00104Q107 1024441        3            2 FACTORY CARD OUTLET CORP        AFC ENTERPRISES INC         NA            NA           
 9 00104Q107 1041379        3          226 AFC ENTERPRISES INC             AFC ENTERPRISES INC         NA            NA           
10 00104Q107 1041379        3          226 POPEYES LOUISIANA KITCHEN, INC. AFC ENTERPRISES INC         NA            NA           
11 00104Q107 1491501        3            2 TMS International Corp.         AFC ENTERPRISES INC         NA            NA           
12 001250109  875354        3           24 AG SERVICES OF AMERICA INC      AG SVCS AMER INC            NA            NA           
13 001250109 1077552        3            2 GENTEK INC                      AG SVCS AMER INC            NA            NA           
14 001250109 1099517        3            2 PORT FINANCIAL CORP             AG SVCS AMER INC            NA            NA           
15 001282102  356080        3           16 BELO CORP                       A H BELO CORP               NA            NA           
16 001282102 1413898        3          162 A. H. Belo CORP                 A H BELO CORP               NA            NA           
17 001282102 1413898        3          162 A. H. Belo Corp                 A H BELO CORP               NA            NA           
18 001282102 1423296        3            2 Zuckerman Investment Group, LLC A H BELO CORP               NA            NA           
19 001296102  812900        3            2 SAKS INC                        AHL SVCS INC                NA            NA           
20 001296102  878720        3            4 FAO INC                         AHL SVCS INC                NA            NA           

and num_ciks = 4

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 4) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% head(20)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 20 x 8
   cusip         cik num_ciks multiplicity company_name                             issuer_name_1    issuer_name_2 issuer_name_3
   <chr>       <int>    <int>        <int> <chr>                                    <chr>            <chr>         <chr>        
 1 001947100   49899        4            4 IMPERIAL BANCORP                         AT PLASTICS INC  NA            NA           
 2 001947100  225628        4            4 MEGADATA CORP                            AT PLASTICS INC  NA            NA           
 3 001947100 1011694        4           16 AT PLASTICS INC                          AT PLASTICS INC  NA            NA           
 4 001947100 1012616        4            2 BANK PLUS CORP                           AT PLASTICS INC  NA            NA           
 5 003881307   93456        4            2 STANDARD REGISTER CO                     ACACIA RESH CORP NA            NA           
 6 003881307   93459        4            2 STANDARD RESOURCES CORP                  ACACIA RESH CORP NA            NA           
 7 003881307  934549        4          164 ACACIA RESEARCH CORP                     ACACIA RESH CORP NA            NA           
 8 003881307 1500866        4            2 InterXion Holding N.V.                   ACACIA RESH CORP NA            NA           
 9 004816104    2098        4          130 ACME UNITED CORP                         ACME UTD CORP    NA            NA           
10 004816104  732847        4            2 FIRST WILSHIRE SECURITIES MANAGEMENT INC ACME UTD CORP    NA            NA           
11 004816104 1071625        4            4 DOVER SADDLERY INC                       ACME UTD CORP    NA            NA           
12 004816104 1342857        4            1 North Star Investment Management Corp.   ACME UTD CORP    NA            NA           
13 004931101  350846        4            2 SUPREME INDUSTRIES INC                   ACTRADE INTL LTD NA            NA           
14 004931101  704508        4            2 MOTO PHOTO INC                           ACTRADE INTL LTD NA            NA           
15 004931101  819255        4           23 ACTRADE INTERNATIONAL LTD                ACTRADE INTL LTD NA            NA           
16 004931101  819255        4           23 ACTRADE FINANCIAL TECHNOLOGIES LTD       ACTRADE INTL LTD NA            NA           
17 004931101  938206        4            1 DRIEHAUS CAPITAL MANAGEMENT INC          ACTRADE INTL LTD NA            NA           
18 00506J107 1047188        4            2 PENWEST PHARMACEUTICALS CO               ACTIVCARD CORP   NA            NA           
19 00506J107 1183941        4           14 ACTIVCARD CORP                           ACTIVCARD CORP   NA            NA           
20 00506J107 1202157        4            2 BROOKFIELD HOMES CORP                    ACTIVCARD CORP   NA            NA   
bdcallen commented 4 years ago

So choosing only the rows for num_ciks = 5 where the multiplicity is equal to the maximum grouped by cusip, I got

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 5) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% head(20)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 20 x 8
   cusip         cik num_ciks multiplicity company_name                       issuer_name_1                 issuer_name_2 issuer_name_3
   <chr>       <int>    <int>        <int> <chr>                              <chr>                         <chr>         <chr>        
 1 001084102  880266        5          238 AGCO CORP /DE                      AGCO CORP DEL                 NA            NA           
 2 00724F101  796343        5          212 ADOBE SYSTEMS INC                  ADOBE SYS INC                 NA            NA           
 3 00724F101  796343        5          212 ADOBE INC.                         ADOBE SYS INC                 NA            NA           
 4 00763M108 1168335        5          132 ADVANCED MEDICAL OPTICS INC        ADVANCED MED OPTICS INC       NA            NA           
 5 00763M108 1168335        5          132 ABBOTT MEDICAL OPTICS INC          ADVANCED MED OPTICS INC       NA            NA           
 6 034918102 1096481        5           94 ANGIOTECH PHARMACEUTICALS INC      ANGIOTECH PHARMACEUTICALS INC NA            NA           
 7 039381108  915390        5           54 ARCH COMMUNICATIONS GROUP INC /DE/ ARCH COMMUNICATIONS GROUP INC NA            NA           
 8 039381108  915390        5           54 ARCH WIRELESS INC                  ARCH COMMUNICATIONS GROUP INC NA            NA           
 9 048426100 1404296        5           92 Atlantic Coast Financial CORP      ATLANTIC COAST FINL CORP      NA            NA           
10 048798102  948975        5          104 ATLANTIC REALTY TRUST              ATLANTIC RLTY TR              NA            NA           
11 065908501  921768        5           62 BANKATLANTIC BANCORP INC           BANKATLANTIC BANCORP INC      NA            NA           
12 105368104  790816        5           30 BRANDYWINE REALTY TRUST            BRANDYWINE RLTY TR            NA            NA           
13 109043109   14195        5          234 BRIGGS & STRATTON CORP             BRIGGS & STRATTON CORP        NA            NA           
14 120113105 1046356        5          144 BUILDING MATERIALS HOLDING CORP    BUILDING MATLS HLDG CORP      NA            NA           
15 120113105 1046356        5          144 BUILDING MATERIALS HOLDING Corp    BUILDING MATLS HLDG CORP      NA            NA           
16 125919308  889348        5          152 CPI AEROSTRUCTURES INC             CPI AEROSTRUCTURES INC        NA            NA           
17 127537207  727273        5          220 CADIZ INC                          CADIZ INC                     NA            NA           
18 13123X102  928022        5          230 CALLON PETROLEUM CO                CALLON PETE CO DEL            NA            NA           
19 13123X102  928022        5          230 Callon Petroleum Co                CALLON PETE CO DEL            NA            NA           
20 141665109 1046568        5          332 CAREER EDUCATION CORP              CAREER ED CORP                NA            NA

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 5) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cusip, cik) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1    76
> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 5) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cik) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1    76
> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 5) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cusip) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1    75

The numbers at the end tell us that there is a single cusip in this case which still has two ciks mapped to it. Viewing the entire dataframe, I found it to be 68274T102

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 5) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% filter(cusip == '68274T102')
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 2 x 8
  cusip         cik num_ciks multiplicity company_name          issuer_name_1    issuer_name_2 issuer_name_3
  <chr>       <int>    <int>        <int> <chr>                 <chr>            <chr>         <chr>        
1 68274T102 1158694        5            4 ONLINE HOLDINGS INC   ONLINE HLDGS INC NA            NA           
2 68274T102 1388320        5            4 Cactus Ventures, Inc. ONLINE HLDGS INC NA            NA     

> ciks %>% filter(cik %in% c(1158694, 1388320))
# A tibble: 5 x 2
      cik company_name                  
    <int> <chr>                         
1 1388320 ACTINIUM PHARMACEUTICALS, INC.
2 1388320 CACTUS VENTURES, INC.         
3 1158694 EFACTOR GROUP CORP.           
4 1158694 ONLINE HOLDINGS INC           
5 1158694 STANDARD DRILLING, INC.       

Note that the multiplicity for this case for both rows is very low (imposing a multiplicity of 10 would wipe these out). One of them still looks to be the correct one, however, looking at the names.

I think this case suggests that if the multiplicity is low (in which case choosing multiplicity >= 10 helps alleviate), or if the order of magnitude of multiplicities is similar (in which case choosing multiplicity >= 10 may not necessarily help, as seemed to be the case with Rivus Bond Fund), that choosing the maximum multiplicity row or rows may not always single out the correct cik.

bdcallen commented 4 years ago

Repeating for num_ciks = 2

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 2) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% head(20)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 20 x 8
   cusip         cik num_ciks multiplicity company_name                       issuer_name_1               issuer_name_2 issuer_name_3
   <chr>       <int>    <int>        <int> <chr>                              <chr>                       <chr>         <chr>        
 1 000361105    1750        2          320 AAR CORP                           AAR CORP                    NA            NA           
 2 000573105  748103        2            6 ABC DISPENSING TECHNOLOGIES INC    ABC DISPENSING TECHNOLOGIES INC           NA           
 3 000752105  913364        2           68 ABC NACO INC                       ABC RAIL PRODS CORP         NA            NA           
 4 000752105  913364        2           68 ABC RAIL PRODUCTS CORP             ABC RAIL PRODS CORP         NA            NA           
 5 000781104  313368        2           10 ABS INDUSTRIES INC /DE/            ABS INDS INC                NA            NA           
 6 00080S101  894081        2           56 ABX AIR INC                        ABX HLDGS INC               NA            NA           
 7 00080S101  894081        2           56 ABX Holdings, Inc.                 ABX HLDGS INC               NA            NA           
 8 000881102 1040792        2           56 A CONSULTING TEAM INC              A CONSULTING TEAM INC       NA            NA           
 9 000955104  918709        2           81 ACT TELECONFERENCING INC           ACT TELECONFERENCING INC    NA            NA           
10 000957100  771497        2          226 ABM INDUSTRIES INC /DE/            ABM INDS INC                NA            NA           
11 000975102  942132        2           32 ACT NETWORKS INC                   ACT NETWORKS INC            NA            NA           
12 001077106  842289        2           34 AG BAG INTERNATIONAL LTD           AG-BAG INTL LTD             NA            NA           
13 001077106  842289        2           34 AB HOLDING GROUP, INC.             AG-BAG INTL LTD             NA            NA           
14 001228105 1514281        2           56 AG Mortgage Investment Trust, Inc. AG MTG INVT TR INC          NA            NA           
15 001303106  837038        2            6 AHPC Holdings, Inc.                AHPC HLDGS INC              NA            NA           
16 00164V103 1514991        2          150 AMC Networks Inc.                  AMC NETWORKS INC            NA            NA           
17 001669100  722077        2          112 AMC ENTERTAINMENT INC              AMC ENTMT INC               NA            NA           
18 001669100  722077        2          112 AMC ENTERTAINMENT INC ET AL        AMC ENTMT INC               NA            NA           
19 001693100  894972        2           18 AMFM INC                           AMFM INC                    NA            NA           
20 001765106    6201        2          244 AMR CORP                           AMR CORP DEL                NA            NA           
> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 2) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cusip, cik) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1  4418
> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 2) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cik) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1  4046
> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 2) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% distinct(cusip) %>% count()
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 1 x 1
      n
  <int>
1  4269

> cusip_cik %>% filter(cusip_length == 9) %>% 
+                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
+                         distinct(cik, cusip) %>% 
+                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
+                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
+                       inner_join(issuers) %>% filter(num_ciks == 2) %>% inner_join(cusip_cik) %>% left_join(ciks) %>% distinct(cusip, cik, num_ciks, multiplicity, company_name,  issuer_name_1, issuer_name_2, issuer_name_3) %>% arrange(cusip) %>% group_by(cusip) %>% filter(multiplicity == max(multiplicity)) %>% ungroup() %>% group_by(cusip) %>% filter(length(unique(cik)) > 1) %>% ungroup() %>% group_by(multiplicity) %>% summarise(num = n()) %>% arrange(desc(num))
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 10 x 2
   multiplicity   num
          <int> <int>
 1            2   198
 2            4    55
 3            6    20
 4            8     7
 5           10     7
 6           16     7
 7           12     5
 8            1     2
 9           20     2
10           30     2
bdcallen commented 4 years ago

Sometimes companies change ciks over time as well, as this example demonstrates. That can also cause multiple ciks per cusip

> cusip_cik %>% filter(cusip_length == 9) %>% 
                         filter(substr(cusip, 9, 9) == as.character(check_digit)) %>% 
                         distinct(cik, cusip) %>% 
                        group_by(cusip) %>% summarise(num_ciks = n()) %>% 
                        arrange(desc(num_ciks)) %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
                       inner_join(issuers) %>% filter(num_ciks == 2) %>% 
                       inner_join(cusip_cik) %>% left_join(ciks) %>% 
            distinct(cusip, cik, num_ciks, multiplicity, company_name,  
                      issuer_name_1, issuer_name_2, issuer_name_3) %>% 
            arrange(cusip) %>% group_by(cusip) %>% 
           filter(multiplicity == max(multiplicity)) %>% ungroup() %>% 
           group_by(cusip) %>% filter(length(unique(cik)) > 1) %>% 
           ungroup() %>% filter(multiplicity == 30)
Joining, by = "cusip6"
Joining, by = c("cusip", "cusip6")
Joining, by = c("cik", "company_name")
# A tibble: 2 x 8
  cusip        cik num_ciks multiplicity company_name                         issuer_name_1                 issuer_name_2 issuer_name_3
  <chr>      <int>    <int>        <int> <chr>                                <chr>                         <chr>         <chr>        
1 31410H101  34945        2           30 FEDERATED DEPARTMENT STORES INC      FEDERATED DEPT STORES INC DEL NA            NA           
2 31410H101 794367        2           30 FEDERATED DEPARTMENT STORES INC /DE/ FEDERATED DEPT STORES INC DEL NA            NA

> ciks %>% filter(cik == 34945)
# A tibble: 1 x 2
    cik company_name                   
  <int> <chr>                          
1 34945 FEDERATED DEPARTMENT STORES INC
> ciks %>% filter(cik == 794367)
# A tibble: 2 x 2
     cik company_name                        
   <int> <chr>                               
1 794367 FEDERATED DEPARTMENT STORES INC /DE/
2 794367 MACY'S, INC.    
iangow commented 4 years ago

@bdcallen

Two issues:

  1. You are still mostly posting code that isn't in the form of "reproducible examples" (reprex). This is OK here and there, but if you can switch to the other form, it would make these code snippets much more useful.
  2. You are not synthesizing the information into amounts that I can process (even when you mention me with @iangow).

Anyway, my synthesis is that there are (at least) two cases here:

  1. Cases where the filer incorrectly used its CUSIP (and in these cases, it seems be a habit of some filers).
  2. Cases where there are legitimately more than one CIK for a given CUSIP.

I think we may want to handle Case 1 first. I think the simplest approach would be to simply make a table that records valid CUSIP-CIK matches for these CUSIPs. I assume that there aren't too many of these, so even manually looking up CIKs should be quite feasible. I have created a Google Sheets document here for you to enter valid CIKs. (Note that this issue is definitely not one to be addressed in parsing. The Python code is correctly parsing these filings, but they contain bad information.) Google Sheets are easy to maintain and it's easy to incorporate the data from them.

If this approach to Case 1 is feasible, then we might not have to worry about Case 2, as these are valid matches.

iangow commented 4 years ago

Based on my analysis from 14 May, I think neither issue is one of parsing (second one is not even) an issue. So I'm closing this one with "no action" for the code.