Closed iangow closed 4 years ago
I requested this because it seems slow when I went to update my own version of EDGAR.
This seems much faster:
library(dplyr, warn.conflicts = FALSE)
library(stringr)
parse_text <- function(lines) {
mat <- str_split_fixed(lines, ":(?=(\\d{10}|$))", 3)
colnames(mat) <- c("company_name", "cik", "nothing")
as_tibble(mat) %>% select(1:2)
}
url <- "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
ciks <- parse_text(readLines(url))
@iangow I just tried the above R code versus get_ciks.py. Yes, the R code seems faster by a factor of around 2 or 3, probably because of vectorization in the code. I wanted to compare the two tables before finishing this off, so I did
> ciks$cik <- as.integer(ciks$cik)
> ciks
# A tibble: 756,434 x 2
company_name cik
<chr> <int>
1 !J INC 1438823
2 #1 A LIFESAFER HOLDINGS, INC. 1509607
3 #1 ARIZONA DISCOUNT PROPERTIES LLC 1457512
4 #1 PAINTBALL CORP 1433777
5 $ LLC 1427189
6 $AVY, INC. 1655250
7 & S MEDIA GROUP LLC 1447162
8 &TV COMMUNICATIONS INC. 1479357
9 &VEST DOMESTIC FUND II KPIV, L.P. 1802417
10 &VEST DOMESTIC FUND II LP 1800903
# … with 756,424 more rows
> ciks_table <- tbl(pg, sql("SELECT * FROM edgar.ciks")) %>% collect()
> ciks_table
# A tibble: 756,434 x 2
cik company_name
* <dbl> <chr>
1 1438823 !J INC
2 1509607 #1 A LIFESAFER HOLDINGS, INC.
3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
4 1433777 #1 PAINTBALL CORP
5 1427189 $ LLC
6 1655250 $AVY, INC.
7 1447162 & S MEDIA GROUP LLC
8 1479357 &TV COMMUNICATIONS INC.
9 1802417 &VEST DOMESTIC FUND II KPIV, L.P.
10 1800903 &VEST DOMESTIC FUND II LP
# … with 756,424 more rows
> ciks_table %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 756,432 x 2
cik company_name
<dbl> <chr>
1 1438823 !J INC
2 1509607 #1 A LIFESAFER HOLDINGS, INC.
3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
4 1433777 #1 PAINTBALL CORP
5 1427189 $ LLC
6 1655250 $AVY, INC.
7 1447162 & S MEDIA GROUP LLC
8 1479357 &TV COMMUNICATIONS INC.
9 1802417 &VEST DOMESTIC FUND II KPIV, L.P.
10 1800903 &VEST DOMESTIC FUND II LP
# … with 756,422 more rows
> ciks_table %>% anti_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 2 x 2
cik company_name
<dbl> <chr>
1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA
2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA
> ciks %>% anti_join(ciks_table)
Joining, by = c("company_name", "cik")
# A tibble: 2 x 2
company_name cik
<chr> <int>
1 "EXPLORA INVESTIMENTOS GEST\xc3O DE RECURSOS LTDA" 1538467
2 "PRAGMA GEST\xc3O DE PATRIM\xd4NIO LTDA" 1555512
So all the rows in each table match, except for two rows in which the character encoding differs in the company_name
s.
Thus, I will put the R code in a file get_ciks.R
and get rid of the python program. I still would like to keep the ciks
table, as from memory there are some differences from the analogous table generated from filings
(I will make some comments in the related issue).
Joining, by = c("cik", "company_name")
A tibble: 2 x 2
cik company_name <dbl> <chr>
1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA 2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA
ciks %>% anti_join(ciks_table) Joining, by = c("company_name", "cik")
A tibble: 2 x 2
company_name cik
1 "EXPLORA INVESTIMENTOS GEST\xc3O DE RECURSOS LTDA" 1538467 2 "PRAGMA GEST\xc3O DE PATRIM\xd4NIO LTDA" 1555512 ``` So all the rows in each table match, except for two rows in which the character encoding differs in the
company_name
s.Thus, I will put the R code in a file
get_ciks.R
and get rid of the python program. I still would like to keep theciks
table, as from memory there are some differences from the analogous table generated fromfilings
(I will make some comments in the related issue).
I can't follow your code (which is the R-generated table?). But if the encoding is correct in the Python-generated one, but not the R-generated one, then we need to fix the R code. This shouldn't be too difficult, as it should be possible to assume the same encoding as the Python code uses. (Make a new issue for this and then, I think, close this one.)
@iangow
> ciks <- parse_text(readLines(url, encoding = 'latin1'))
> ciks$cik <- as.integer(ciks$cik)
> ciks <- ciks[, c('cik', 'company_name')]
> ciks %>% filter(cik == 1555512)
# A tibble: 2 x 2
cik company_name
<int> <chr>
1 1555512 PRAGMA GESTAO DE PATRIMONIO LTD
2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA
> ciks %>% filter(cik == 1538467)
# A tibble: 1 x 2
cik company_name
<int> <chr>
1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA
> ciks %>% anti_join(ciks_table)
Joining, by = c("cik", "company_name")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, company_name <chr>
> ciks_table %>% anti_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 0 x 2
# … with 2 variables: cik <dbl>, company_name <chr>
So changing the encoding to 'latin1' fixes the discrepancy. I was trying to get utf-8 to work, but R handles utf-8 conversions really weirdly it seems. Anyway, we have workable solution, so I will do the next trivial steps and close this issue.
So changing the encoding to 'latin1' fixes the discrepancy. I was trying to get utf-8 to work, but R handles utf-8 conversions really weirdly it seems.
You have to use whatever encoding is used in the file you're reading. Once the data are in R (or Python 3 or PostgresSQL [using default settings]), the data will be stored using UTF-8. It is possible that the encoding changes from file to file (e.g., if SEC changes it), but if it's valid text, there's just one encoding per file.
@bdcallen @dingtq I don't think this code is necessary at all.