mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Remove get_ciks.py #74

Closed iangow closed 4 years ago

iangow commented 4 years ago

@bdcallen @dingtq I don't think this code is necessary at all.

iangow commented 4 years ago

I requested this because it seems slow when I went to update my own version of EDGAR.

This seems much faster:

library(dplyr, warn.conflicts = FALSE)
library(stringr)

parse_text <- function(lines) {
    mat <- str_split_fixed(lines, ":(?=(\\d{10}|$))", 3)
    colnames(mat) <- c("company_name", "cik", "nothing")
    as_tibble(mat) %>% select(1:2)
}

url <- "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"

ciks <- parse_text(readLines(url))
bdcallen commented 4 years ago

@iangow I just tried the above R code versus get_ciks.py. Yes, the R code seems faster by a factor of around 2 or 3, probably because of vectorization in the code. I wanted to compare the two tables before finishing this off, so I did

> ciks$cik <- as.integer(ciks$cik)
> ciks
# A tibble: 756,434 x 2
   company_name                           cik
   <chr>                                <int>
 1 !J INC                             1438823
 2 #1 A LIFESAFER HOLDINGS, INC.      1509607
 3 #1 ARIZONA DISCOUNT PROPERTIES LLC 1457512
 4 #1 PAINTBALL CORP                  1433777
 5 $ LLC                              1427189
 6 $AVY, INC.                         1655250
 7 & S MEDIA GROUP LLC                1447162
 8 &TV COMMUNICATIONS INC.            1479357
 9 &VEST DOMESTIC FUND II KPIV, L.P.  1802417
10 &VEST DOMESTIC FUND II LP          1800903
# … with 756,424 more rows
> ciks_table <- tbl(pg, sql("SELECT * FROM edgar.ciks")) %>% collect()
> ciks_table
# A tibble: 756,434 x 2
       cik company_name                      
 *   <dbl> <chr>                             
 1 1438823 !J INC                            
 2 1509607 #1 A LIFESAFER HOLDINGS, INC.     
 3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
 4 1433777 #1 PAINTBALL CORP                 
 5 1427189 $ LLC                             
 6 1655250 $AVY, INC.                        
 7 1447162 & S MEDIA GROUP LLC               
 8 1479357 &TV COMMUNICATIONS INC.           
 9 1802417 &VEST DOMESTIC FUND II KPIV, L.P. 
10 1800903 &VEST DOMESTIC FUND II LP         
# … with 756,424 more rows
> ciks_table %>% inner_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 756,432 x 2
       cik company_name                      
     <dbl> <chr>                             
 1 1438823 !J INC                            
 2 1509607 #1 A LIFESAFER HOLDINGS, INC.     
 3 1457512 #1 ARIZONA DISCOUNT PROPERTIES LLC
 4 1433777 #1 PAINTBALL CORP                 
 5 1427189 $ LLC                             
 6 1655250 $AVY, INC.                        
 7 1447162 & S MEDIA GROUP LLC               
 8 1479357 &TV COMMUNICATIONS INC.           
 9 1802417 &VEST DOMESTIC FUND II KPIV, L.P. 
10 1800903 &VEST DOMESTIC FUND II LP         
# … with 756,422 more rows
> ciks_table %>% anti_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 2 x 2
      cik company_name                                 
    <dbl> <chr>                                        
1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA
2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA             
> ciks %>% anti_join(ciks_table)
Joining, by = c("company_name", "cik")
# A tibble: 2 x 2
  company_name                                           cik
  <chr>                                                <int>
1 "EXPLORA INVESTIMENTOS GEST\xc3O DE RECURSOS LTDA" 1538467
2 "PRAGMA GEST\xc3O DE PATRIM\xd4NIO LTDA"           1555512

So all the rows in each table match, except for two rows in which the character encoding differs in the company_names.

Thus, I will put the R code in a file get_ciks.R and get rid of the python program. I still would like to keep the ciks table, as from memory there are some differences from the analogous table generated from filings (I will make some comments in the related issue).

iangow commented 4 years ago

Joining, by = c("cik", "company_name")

A tibble: 2 x 2

  cik company_name                                 
<dbl> <chr>                                        

1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA 2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA

ciks %>% anti_join(ciks_table) Joining, by = c("company_name", "cik")

A tibble: 2 x 2

company_name cik

1 "EXPLORA INVESTIMENTOS GEST\xc3O DE RECURSOS LTDA" 1538467 2 "PRAGMA GEST\xc3O DE PATRIM\xd4NIO LTDA" 1555512 ```

So all the rows in each table match, except for two rows in which the character encoding differs in the company_names.

Thus, I will put the R code in a file get_ciks.R and get rid of the python program. I still would like to keep the ciks table, as from memory there are some differences from the analogous table generated from filings (I will make some comments in the related issue).

I can't follow your code (which is the R-generated table?). But if the encoding is correct in the Python-generated one, but not the R-generated one, then we need to fix the R code. This shouldn't be too difficult, as it should be possible to assume the same encoding as the Python code uses. (Make a new issue for this and then, I think, close this one.)

bdcallen commented 4 years ago

@iangow

> ciks <- parse_text(readLines(url, encoding = 'latin1'))
> ciks$cik <- as.integer(ciks$cik)
> ciks <- ciks[, c('cik', 'company_name')]
> ciks %>% filter(cik ==  1555512)
# A tibble: 2 x 2
      cik company_name                    
    <int> <chr>                           
1 1555512 PRAGMA GESTAO DE PATRIMONIO LTD 
2 1555512 PRAGMA GESTÃO DE PATRIMÔNIO LTDA
> ciks %>% filter(cik == 1538467)
# A tibble: 1 x 2
      cik company_name                                 
    <int> <chr>                                        
1 1538467 EXPLORA INVESTIMENTOS GESTÃO DE RECURSOS LTDA
> ciks %>% anti_join(ciks_table)
Joining, by = c("cik", "company_name")
# A tibble: 0 x 2
# … with 2 variables: cik <int>, company_name <chr>
> ciks_table %>% anti_join(ciks)
Joining, by = c("cik", "company_name")
# A tibble: 0 x 2
# … with 2 variables: cik <dbl>, company_name <chr>

So changing the encoding to 'latin1' fixes the discrepancy. I was trying to get utf-8 to work, but R handles utf-8 conversions really weirdly it seems. Anyway, we have workable solution, so I will do the next trivial steps and close this issue.

iangow commented 4 years ago

So changing the encoding to 'latin1' fixes the discrepancy. I was trying to get utf-8 to work, but R handles utf-8 conversions really weirdly it seems.

You have to use whatever encoding is used in the file you're reading. Once the data are in R (or Python 3 or PostgresSQL [using default settings]), the data will be stored using UTF-8. It is possible that the encoding changes from file to file (e.g., if SEC changes it), but if it's valid text, there's just one encoding per file.