mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Make edgar.ciks table #72

Closed bdcallen closed 3 years ago

bdcallen commented 4 years ago

@iangow There seems to be a bulk extract file on EDGAR for looking up ciks here. Furthermore, the formatting of the file is extremely simple and thus highly scrapable. From the file, one can see that a table can easily be formed which links cik with company_name. This issue is for making such a table from this file.

bdcallen commented 4 years ago

@iangow I have written a program get_ciks.py which does the job, and I have run it successfully. Also, the ownership has been set to edgar and the access to edgar_access. I'm going to add a line for this program in the main edgar cronjob.

bdcallen commented 4 years ago

@iangow Before this gets closed, it might be a good idea to: (1) check whether the table grows daily with the cronjob and (2) compare company names in the table to those found in edgar.filings.

iangow commented 4 years ago

Actually, we don't need this table, as it can be generated from edgar.filings. In fact this table will only have the latest name, while our table will have past names too.

iangow commented 4 years ago

@bdcallen I think the way to "close" this issue might be to compare what we get from the ciks table with what we'd get from SELECT DISTINCT cik, company_name FROM edgar.filings.

If the former is a subset of the latter, than I don't think we need the former (and we could delete the table, remove the code, and close the issue).

If it isn't then it might be useful to understand where the differences are coming from and how the incremental data might be useful.

bdcallen commented 4 years ago

@iangow So I did, with ciks the table generated from the R code in issue #74,

> ciks_from_filings <- tbl(pg, sql("SELECT DISTINCT cik, company_name FROM edgar.filings")) %>% collect()
> ciks
# A tibble: 756,434 x 2
       cik company_name                      
     <int> <chr>                             
 1 1438823 !J INC                            
 2 1509607 #1 A LIFESAFER HOLDINGS, INC.     
 3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
 4 1433777 #1 PAINTBALL CORP                 
 5 1427189 $ LLC                             
 6 1655250 $AVY, INC.                        
 7 1447162 & S MEDIA GROUP LLC               
 8 1479357 &TV COMMUNICATIONS INC.           
 9 1802417 &VEST DOMESTIC FUND II KPIV, L.P. 
10 1800903 &VEST DOMESTIC FUND II LP         
# … with 756,424 more rows
> ciks_from_filings
# A tibble: 755,708 x 2
     cik company_name                                              
 * <int> <chr>                                                     
 1     3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33  
 2    14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155 
 3    17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156 
 4    18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
 5    20 K TRON INTERNATIONAL INC                                  
 6    49 NEW YORK MUNICIPAL TRUST SERIES 15                        
 7    51 UC CORP                                                   
 8  1750 AAR CORP                                                  
 9  1761 TRANZONIC COMPANIES                                       
10  1800 ABBOTT LABORATORIES                                       
# … with 755,698 more rows

So the table generated from filings has 726 less rows, so it must be a subset of ciks. Secondly, I just did

> ciks_from_filings %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 367,519 x 2
     cik company_name                                              
   <int> <chr>                                                     
 1     3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33  
 2    14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155 
 3    17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156 
 4    18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
 5    20 K TRON INTERNATIONAL INC                                  
 6    49 NEW YORK MUNICIPAL TRUST SERIES 15                        
 7    51 UC CORP                                                   
 8  1750 AAR CORP                                                  
 9  1761 TRANZONIC COMPANIES                                       
10  1800 ABBOTT LABORATORIES                                       
# … with 367,509 more rows

so only around half of the rows in the table generated from filings completely match rows in ciks. The ciks in both tables are integers, and any cik should be in ciks, so it should be the case that it is mainly the company names that have differences.

> ciks_from_filings %>% anti_join(ciks) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 388,189 x 2
     cik company_name                    
   <int> <chr>                           
 1  1853 MotivNation, Inc.               
 2  1961 Worlds.com, Inc.                
 3  2186 BK Technologies Corp            
 4  2186 BK Technologies, Inc.           
 5  2646 Voya INTERMEDIATE BOND PORTFOLIO
 6  2663 Voya MONEY MARKET PORTFOLIO     
 7  2664 Voya VARIABLE FUNDS             
 8  3327 New Alberto-Culver LLC          
 9  3453 Matson, Inc.                    
10  3570 Cheniere Energy, Inc.           
# … with 388,179 more rows
> ciks %>% anti_join(ciks_from_filings) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 388,915 x 2
     cik company_name                                               
   <int> <chr>                                                      
 1    13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES      
 2  1853 MOTIVNATION, INC.                                          
 3  1961 WORLDS.COM, INC.                                           
 4  2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN            
 5  2186 BK TECHNOLOGIES CORP                                       
 6  2186 BK TECHNOLOGIES, INC.                                      
 7  2646 VOYA INTERMEDIATE BOND PORTFOLIO                           
 8  2663 VOYA MONEY MARKET PORTFOLIO                                
 9  2664 VOYA VARIABLE FUNDS                                        
10  2903 AMVESTORS INVESTMENT GROUP INC                          /BD
# … with 388,905 more rows

Capitalization removes the vast majority of these cases

> ciks_from_filings$company_name <- stringr::str_to_upper(ciks_from_filings$company_name)
> ciks_from_filings %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 749,818 x 2
     cik company_name                                              
   <int> <chr>                                                     
 1     3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33  
 2    14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155 
 3    17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156 
 4    18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
 5    20 K TRON INTERNATIONAL INC                                  
 6    49 NEW YORK MUNICIPAL TRUST SERIES 15                        
 7    51 UC CORP                                                   
 8  1750 AAR CORP                                                  
 9  1761 TRANZONIC COMPANIES                                       
10  1800 ABBOTT LABORATORIES                                       
# … with 749,808 more rows

Then, I did the following anti_joins, mostly by cik

> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% arrange(cik, company_name)
# A tibble: 2,274 x 2
     cik company_name                                                
   <int> <chr>                                                       
 1    13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES       
 2  2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN             
 3  2903 AMVESTORS INVESTMENT GROUP INC                          /BD 
 4  2915 AINSBROOKE CORP                                             
 5  3656 ALLECO INC                                                  
 6  3711 ALLEN ETHAN INC                                             
 7  3878 ALLIED FARM EQUIPMENT INC 1975 QUALIFIED STOCK OPTION PLAN  
 8  4261 ALTOONA & LOGAN VALLEY ELECTRIC RAILWAY                     
 9  4280 ALUMINUM & CHEMICAL CORP                                    
10  4444 AMERACE CORP 1967 QUAL STK OPT PL & 1972 QUAL-NON-QUAL STK O
# … with 2,264 more rows
> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% anti_join(ciks_from_filings, by = c('company_name')) %>% arrange(cik, company_name)
# A tibble: 1,845 x 2
     cik company_name                                                
   <int> <chr>                                                       
 1    13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES       
 2  2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN             
 3  2903 AMVESTORS INVESTMENT GROUP INC                          /BD 
 4  2915 AINSBROOKE CORP                                             
 5  3656 ALLECO INC                                                  
 6  3711 ALLEN ETHAN INC                                             
 7  3878 ALLIED FARM EQUIPMENT INC 1975 QUALIFIED STOCK OPTION PLAN  
 8  4261 ALTOONA & LOGAN VALLEY ELECTRIC RAILWAY                     
 9  4280 ALUMINUM & CHEMICAL CORP                                    
10  4444 AMERACE CORP 1967 QUAL STK OPT PL & 1972 QUAL-NON-QUAL STK O
# … with 1,835 more rows
> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% inner_join(ciks_from_filings, by = c('company_name')) %>% arrange(cik.x, company_name)
# A tibble: 457 x 3
     cik.x company_name                                cik.y
     <int> <chr>                                       <int>
 1    4926 AMERICAN ENTERPRISE DEVELOPMENT CORP      1136725
 2  859087 LA JOLLA PHARMACEUTICAL CO                 920465
 3 1001554 IMI INTERNATIONAL MEDICAL INNOVATIONS INC 1179083
 4 1055295 FROBLICH PETER                            1005782
 5 1059124 WHITE WILLIAM H                           1132020
 6 1059329 WOOD CHESTER A JR                         1390976
 7 1068124 COMPANY NAME                              1068130
 8 1077027 WALKER CHRISTOPHER                        1522583
 9 1077421 BREEDEN RICHARD C                         1300622
10 1077995 STONE JAMES                               1515176
# … with 447 more rows
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% arrange(cik, company_name)
# A tibble: 2 x 2
      cik company_name                     
    <int> <chr>                            
1 1145141 JFL INTERNATIONAL EQUITY FUND LLC
2 1363814 AFFLECK-GRAVES JOHN
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% inner_join(ciks, by = 'company_name') 
# A tibble: 2 x 3
    cik.x company_name                        cik.y
    <int> <chr>                               <int>
1 1145141 JFL INTERNATIONAL EQUITY FUND LLC  877248
2 1363814 AFFLECK-GRAVES JOHN               1297755
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% anti_join(ciks, by = 'company_name') %>% arrange(cik, company_name)
# A tibble: 0 x 2
# … with 2 variables: cik <int>, company_name <chr>

So it seems that that table generated from filings

# A tibble: 6,237 x 2
     cik company_name                                                          
   <int> <chr>                                                                 
 1  8775 "SEPARATE ACCOUNT VA B OF COMMONWEALTH ANNUITY & LIFE INSURANCE CO"   
 2  8776 "SEPARATE ACCOUNT VA-C OF COMMONWEALTH ANNUITY& LIFE INSURANCE CO"    
 3 13463 "BOTZUM C A & CO\\\\CA                                      /BD"      
 4 23055 "GROUP ONE SOFTWARE INC"                                              
 5 31079 "WORKSAFE INDUSTRIAL INC"                                             
 6 32859 "ENDOWMENTS GROWTH & INCOME PORTFOLIO"                                
 7 36104 "US BANCORP \\\\DE\\\\"                                               
 8 38395 "FOUNDERS ASSET MANAGEMENT INC \\\\CO\\\\                      /BD"   
 9 39561 "ING BARING FURMAN SELZ LLC //NY"                                     
10 41827 "GUARDIAN INVESTOR SERVICES LLC\\\\\\\\\\\\NY                     /BD"
# … with 6,227 more rows
> ciks_from_filings %>% anti_join(ciks) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 5,890 x 2
      cik company_name                                                    
    <int> <chr>                                                           
 1   8775 "SEPARATE ACCOUNT VA B OF COMMONWEALTH ANNUITY & LIFE INSURAN"  
 2   8776 "SEPARATE ACCOUNT VA-C OF COMMONWEALTH ANNUITY& LIFE INSURANC"  
 3  13463 "BOTZUM C A & CO\\CA                                      /BD"  
 4  36104 "US BANCORP \\DE\\"                                             
 5  38395 "FOUNDERS ASSET MANAGEMENT INC \\CO\\                      /BD" 
 6  41827 "GUARDIAN INVESTOR SERVICES LLC\\\\\\NY                     /BD"
 7  49281 "HUSS SERVICES INC\\\\NY                                   /BD" 
 8  61546 "MAHLER & EMERSON INC\\\\NY                                /BD" 
 9  78613 "D A PINCUS & CO INC\\                                    /BD"  
10 103005 "VARIABLE ANNUITY ACCOUNT B OF ING LIFE INSURANCE & ANNUITY C"  
# … with 5,880 more rows

Also, comparing the first few rows above, it seems filings does not have the full name, particularly if it's long, and that there seems to be some escaping and encoding issues.

iangow commented 4 years ago

@bdcallen You never really came to a conclusion here. Here's my summary:

  1. Neither edgar.filings nor edgar.ciks is a subset of the other.
  2. The variable company_name in edgar.ciks may be preferable for some purposes due to abbreviations and encoding issues.

I think the actions are:

  1. Keep edgar.ciks, but perhaps make a table that has "everything" in it.
  2. Put all code on this issue into a ciks directory. (@dingtq: This is purely edgar-related code, so doesn't belong in a directEDGAR repository.)
  3. Examine the encoding issues with edgar.filings. This would involve identifying an index file with such an issue and tweaking the getSECIndexFile function to try to get the encoding correct.
  4. Make GitHub issues for each of the above immediately, but don't do anything on these issues just yet (other issues are probably more pressing).
  5. Close this issue.
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")

library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

filings <- tbl(pg, "filings")
ciks <- tbl(pg, "ciks")

ciks_from_filings <- 
    filings %>%
    distinct(cik, company_name) %>%
    compute()

# CIKs in edgar.ciks, but not edgar.filings 
ciks %>%
    anti_join(ciks_from_filings, by="cik") %>%
    count() %>%
    pull()
#> integer64
#> [1] 2166

ciks_from_filings %>%
    anti_join(ciks, by="cik") %>%
    count() %>%
    pull()
#> integer64
#> [1] 2

ciks_from_filings %>%
    anti_join(ciks, by="cik") 
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>       cik company_name                     
#>     <int> <chr>                            
#> 1 1363814 Affleck-Graves John              
#> 2 1145141 JFL INTERNATIONAL EQUITY FUND LLC

ciks_from_filings %>% 
    mutate(company_name = toupper(company_name)) %>%
    anti_join(
        ciks %>% 
        mutate(company_name = toupper(company_name))) %>% 
    arrange(cik, company_name) %>%
    count()
#> Joining, by = c("cik", "company_name")
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 5892

ciks %>% 
    mutate(company_name = toupper(company_name)) %>%
    anti_join(
        ciks_from_filings %>% 
        mutate(company_name = toupper(company_name))) %>% 
    arrange(cik, company_name) %>%
    count()
#> Joining, by = c("cik", "company_name")
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 8403

Created on 2020-04-09 by the reprex package (v0.3.0)