mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Check how much memory is required to download all edgar filing documents #54

Closed bdcallen closed 5 years ago

bdcallen commented 5 years ago

@iangow As discussed, this issue is to check how much memory is needed to store all the documents associated with filing_docs

bdcallen commented 5 years ago

@iangow I just did this

SELECT SUM(size) FROM edgar.filing_docs WHERE size IS NOT NULL 

and got the answer 3635594889026, which corresponds to 3.64 Terabytes (size in edgar.filing_docs is in bytes). This is obviously not on the full set which is being downloaded in Boston, so it could be that the amount of memory needed is significantly larger than this.

iangow commented 5 years ago

OK. Perhaps also good to do it by type:

library(DBI)
library(dplyr, warn.conflicts = FALSE)

pg <- dbConnect(RPostgreSQL::PostgreSQL()) #  bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO edgar, public")
rs <- dbExecute(pg, "SET work_mem = '2GB'")

filings <- tbl(pg, "filings")
filing_docs <- tbl(pg, "filing_docs")

size_by_type <-
    filings %>%
    inner_join(filing_docs, by = "file_name") %>%
    group_by(form_type) %>%
    summarize(size = sum(size, na.rm = TRUE)) %>%
    compute()

size_by_type %>%
    filter(!is.na(size)) %>% 
    mutate(size = size/1e9) %>% 
    arrange(desc(size))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres 9.6.11 [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(size)
#>    form_type   size
#>    <chr>      <dbl>
#>  1 8-K       1317. 
#>  2 10-K      1203. 
#>  3 6-K        429. 
#>  4 DEF 14A    208. 
#>  5 4          128. 
#>  6 10-K/A      77.5
#>  7 8-K/A       60.5
#>  8 13F-HR      47.0
#>  9 10-K405     15.8
#> 10 3           11.4
#> # … with more rows

Created on 2019-01-22 by the reprex package (v0.2.1)

iangow commented 5 years ago

Maybe I should organize a 6TB hard drive and then we should just download it.