Closed bdcallen closed 3 years ago
@iangow I have written a program get_ciks.py
which does the job, and I have run it successfully. Also, the ownership has been set to edgar
and the access to edgar_access
. I'm going to add a line for this program in the main edgar cronjob.
@iangow Before this gets closed, it might be a good idea to: (1) check whether the table grows daily with the cronjob and (2) compare company names in the table to those found in edgar.filings
.
Actually, we don't need this table, as it can be generated from edgar.filings
. In fact this table will only have the latest name, while our table will have past names too.
@bdcallen I think the way to "close" this issue might be to compare what we get from the ciks
table with what we'd get from SELECT DISTINCT cik, company_name FROM edgar.filings
.
If the former is a subset of the latter, than I don't think we need the former (and we could delete the table, remove the code, and close the issue).
If it isn't then it might be useful to understand where the differences are coming from and how the incremental data might be useful.
@iangow So I did, with ciks the table generated from the R code in issue #74,
> ciks_from_filings <- tbl(pg, sql("SELECT DISTINCT cik, company_name FROM edgar.filings")) %>% collect()
> ciks
# A tibble: 756,434 x 2
cik company_name
<int> <chr>
1 1438823 !J INC
2 1509607 #1 A LIFESAFER HOLDINGS, INC.
3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
4 1433777 #1 PAINTBALL CORP
5 1427189 $ LLC
6 1655250 $AVY, INC.
7 1447162 & S MEDIA GROUP LLC
8 1479357 &TV COMMUNICATIONS INC.
9 1802417 &VEST DOMESTIC FUND II KPIV, L.P.
10 1800903 &VEST DOMESTIC FUND II LP
# … with 756,424 more rows
> ciks_from_filings
# A tibble: 755,708 x 2
cik company_name
* <int> <chr>
1 3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33
2 14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155
3 17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156
4 18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
5 20 K TRON INTERNATIONAL INC
6 49 NEW YORK MUNICIPAL TRUST SERIES 15
7 51 UC CORP
8 1750 AAR CORP
9 1761 TRANZONIC COMPANIES
10 1800 ABBOTT LABORATORIES
# … with 755,698 more rows
So the table generated from filings has 726 less rows, so it must be a subset of ciks. Secondly, I just did
> ciks_from_filings %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 367,519 x 2
cik company_name
<int> <chr>
1 3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33
2 14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155
3 17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156
4 18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
5 20 K TRON INTERNATIONAL INC
6 49 NEW YORK MUNICIPAL TRUST SERIES 15
7 51 UC CORP
8 1750 AAR CORP
9 1761 TRANZONIC COMPANIES
10 1800 ABBOTT LABORATORIES
# … with 367,509 more rows
so only around half of the rows in the table generated from filings completely match rows in ciks. The ciks in both tables are integers, and any cik should be in ciks
, so it should be the case that it is mainly the company names that have differences.
> ciks_from_filings %>% anti_join(ciks) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 388,189 x 2
cik company_name
<int> <chr>
1 1853 MotivNation, Inc.
2 1961 Worlds.com, Inc.
3 2186 BK Technologies Corp
4 2186 BK Technologies, Inc.
5 2646 Voya INTERMEDIATE BOND PORTFOLIO
6 2663 Voya MONEY MARKET PORTFOLIO
7 2664 Voya VARIABLE FUNDS
8 3327 New Alberto-Culver LLC
9 3453 Matson, Inc.
10 3570 Cheniere Energy, Inc.
# … with 388,179 more rows
> ciks %>% anti_join(ciks_from_filings) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 388,915 x 2
cik company_name
<int> <chr>
1 13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES
2 1853 MOTIVNATION, INC.
3 1961 WORLDS.COM, INC.
4 2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN
5 2186 BK TECHNOLOGIES CORP
6 2186 BK TECHNOLOGIES, INC.
7 2646 VOYA INTERMEDIATE BOND PORTFOLIO
8 2663 VOYA MONEY MARKET PORTFOLIO
9 2664 VOYA VARIABLE FUNDS
10 2903 AMVESTORS INVESTMENT GROUP INC /BD
# … with 388,905 more rows
Capitalization removes the vast majority of these cases
> ciks_from_filings$company_name <- stringr::str_to_upper(ciks_from_filings$company_name)
> ciks_from_filings %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 749,818 x 2
cik company_name
<int> <chr>
1 3 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD NEW YORK SER 33
2 14 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 155
3 17 DEFINED ASSET FUNDS MUNICIPAL INVT TR FD MON PYMT SER 156
4 18 NUVEEN TAX EXEMPT UNIT TRUST SERIES 169 NATIONAL TRUST 169
5 20 K TRON INTERNATIONAL INC
6 49 NEW YORK MUNICIPAL TRUST SERIES 15
7 51 UC CORP
8 1750 AAR CORP
9 1761 TRANZONIC COMPANIES
10 1800 ABBOTT LABORATORIES
# … with 749,808 more rows
Then, I did the following anti_joins, mostly by cik
> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% arrange(cik, company_name)
# A tibble: 2,274 x 2
cik company_name
<int> <chr>
1 13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES
2 2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN
3 2903 AMVESTORS INVESTMENT GROUP INC /BD
4 2915 AINSBROOKE CORP
5 3656 ALLECO INC
6 3711 ALLEN ETHAN INC
7 3878 ALLIED FARM EQUIPMENT INC 1975 QUALIFIED STOCK OPTION PLAN
8 4261 ALTOONA & LOGAN VALLEY ELECTRIC RAILWAY
9 4280 ALUMINUM & CHEMICAL CORP
10 4444 AMERACE CORP 1967 QUAL STK OPT PL & 1972 QUAL-NON-QUAL STK O
# … with 2,264 more rows
> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% anti_join(ciks_from_filings, by = c('company_name')) %>% arrange(cik, company_name)
# A tibble: 1,845 x 2
cik company_name
<int> <chr>
1 13 CORPORATE INCOME FUND SEVENTY NINTH SHORT TERM SERIES
2 2071 ACME ELECTRIC CORP EMPLOYEE STOCK PURCHASE PLAN
3 2903 AMVESTORS INVESTMENT GROUP INC /BD
4 2915 AINSBROOKE CORP
5 3656 ALLECO INC
6 3711 ALLEN ETHAN INC
7 3878 ALLIED FARM EQUIPMENT INC 1975 QUALIFIED STOCK OPTION PLAN
8 4261 ALTOONA & LOGAN VALLEY ELECTRIC RAILWAY
9 4280 ALUMINUM & CHEMICAL CORP
10 4444 AMERACE CORP 1967 QUAL STK OPT PL & 1972 QUAL-NON-QUAL STK O
# … with 1,835 more rows
> ciks %>% anti_join(ciks_from_filings, by = c('cik')) %>% inner_join(ciks_from_filings, by = c('company_name')) %>% arrange(cik.x, company_name)
# A tibble: 457 x 3
cik.x company_name cik.y
<int> <chr> <int>
1 4926 AMERICAN ENTERPRISE DEVELOPMENT CORP 1136725
2 859087 LA JOLLA PHARMACEUTICAL CO 920465
3 1001554 IMI INTERNATIONAL MEDICAL INNOVATIONS INC 1179083
4 1055295 FROBLICH PETER 1005782
5 1059124 WHITE WILLIAM H 1132020
6 1059329 WOOD CHESTER A JR 1390976
7 1068124 COMPANY NAME 1068130
8 1077027 WALKER CHRISTOPHER 1522583
9 1077421 BREEDEN RICHARD C 1300622
10 1077995 STONE JAMES 1515176
# … with 447 more rows
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% arrange(cik, company_name)
# A tibble: 2 x 2
cik company_name
<int> <chr>
1 1145141 JFL INTERNATIONAL EQUITY FUND LLC
2 1363814 AFFLECK-GRAVES JOHN
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% inner_join(ciks, by = 'company_name')
# A tibble: 2 x 3
cik.x company_name cik.y
<int> <chr> <int>
1 1145141 JFL INTERNATIONAL EQUITY FUND LLC 877248
2 1363814 AFFLECK-GRAVES JOHN 1297755
> ciks_from_filings %>% anti_join(ciks, by = c('cik')) %>% anti_join(ciks, by = 'company_name') %>% arrange(cik, company_name)
# A tibble: 0 x 2
# … with 2 variables: cik <int>, company_name <chr>
So it seems that that table generated from filings
# A tibble: 6,237 x 2
cik company_name
<int> <chr>
1 8775 "SEPARATE ACCOUNT VA B OF COMMONWEALTH ANNUITY & LIFE INSURANCE CO"
2 8776 "SEPARATE ACCOUNT VA-C OF COMMONWEALTH ANNUITY& LIFE INSURANCE CO"
3 13463 "BOTZUM C A & CO\\\\CA /BD"
4 23055 "GROUP ONE SOFTWARE INC"
5 31079 "WORKSAFE INDUSTRIAL INC"
6 32859 "ENDOWMENTS GROWTH & INCOME PORTFOLIO"
7 36104 "US BANCORP \\\\DE\\\\"
8 38395 "FOUNDERS ASSET MANAGEMENT INC \\\\CO\\\\ /BD"
9 39561 "ING BARING FURMAN SELZ LLC //NY"
10 41827 "GUARDIAN INVESTOR SERVICES LLC\\\\\\\\\\\\NY /BD"
# … with 6,227 more rows
> ciks_from_filings %>% anti_join(ciks) %>% arrange(cik, company_name)
Joining, by = c("cik", "company_name")
# A tibble: 5,890 x 2
cik company_name
<int> <chr>
1 8775 "SEPARATE ACCOUNT VA B OF COMMONWEALTH ANNUITY & LIFE INSURAN"
2 8776 "SEPARATE ACCOUNT VA-C OF COMMONWEALTH ANNUITY& LIFE INSURANC"
3 13463 "BOTZUM C A & CO\\CA /BD"
4 36104 "US BANCORP \\DE\\"
5 38395 "FOUNDERS ASSET MANAGEMENT INC \\CO\\ /BD"
6 41827 "GUARDIAN INVESTOR SERVICES LLC\\\\\\NY /BD"
7 49281 "HUSS SERVICES INC\\\\NY /BD"
8 61546 "MAHLER & EMERSON INC\\\\NY /BD"
9 78613 "D A PINCUS & CO INC\\ /BD"
10 103005 "VARIABLE ANNUITY ACCOUNT B OF ING LIFE INSURANCE & ANNUITY C"
# … with 5,880 more rows
Also, comparing the first few rows above, it seems filings
does not have the full name, particularly if it's long, and that there seems to be some escaping and encoding issues.
@bdcallen You never really came to a conclusion here. Here's my summary:
edgar.filings
nor edgar.ciks
is a subset of the other.company_name
in edgar.ciks
may be preferable for some purposes due to abbreviations and encoding issues.I think the actions are:
edgar.ciks
, but perhaps make a table that has "everything" in it.ciks
directory. (@dingtq: This is purely edgar
-related code, so doesn't belong in a directEDGAR repository.)edgar.filings
. This would involve identifying an index file with such an issue and tweaking the getSECIndexFile
function to try to get the encoding correct.Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
filings <- tbl(pg, "filings")
ciks <- tbl(pg, "ciks")
ciks_from_filings <-
filings %>%
distinct(cik, company_name) %>%
compute()
# CIKs in edgar.ciks, but not edgar.filings
ciks %>%
anti_join(ciks_from_filings, by="cik") %>%
count() %>%
pull()
#> integer64
#> [1] 2166
ciks_from_filings %>%
anti_join(ciks, by="cik") %>%
count() %>%
pull()
#> integer64
#> [1] 2
ciks_from_filings %>%
anti_join(ciks, by="cik")
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> cik company_name
#> <int> <chr>
#> 1 1363814 Affleck-Graves John
#> 2 1145141 JFL INTERNATIONAL EQUITY FUND LLC
ciks_from_filings %>%
mutate(company_name = toupper(company_name)) %>%
anti_join(
ciks %>%
mutate(company_name = toupper(company_name))) %>%
arrange(cik, company_name) %>%
count()
#> Joining, by = c("cik", "company_name")
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 5892
ciks %>%
mutate(company_name = toupper(company_name)) %>%
anti_join(
ciks_from_filings %>%
mutate(company_name = toupper(company_name))) %>%
arrange(cik, company_name) %>%
count()
#> Joining, by = c("cik", "company_name")
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 8403
Created on 2020-04-09 by the reprex package (v0.3.0)
@iangow There seems to be a bulk extract file on EDGAR for looking up
cik
s here. Furthermore, the formatting of the file is extremely simple and thus highly scrapable. From the file, one can see that a table can easily be formed which linkscik
withcompany_name
. This issue is for making such a table from this file.