Closed bdcallen closed 5 years ago
@iangow I just did this
SELECT SUM(size) FROM edgar.filing_docs WHERE size IS NOT NULL
and got the answer 3635594889026, which corresponds to 3.64 Terabytes (size
in edgar.filing_docs
is in bytes). This is obviously not on the full set which is being downloaded in Boston, so it could be that the amount of memory needed is significantly larger than this.
OK. Perhaps also good to do it by type:
library(DBI)
library(dplyr, warn.conflicts = FALSE)
pg <- dbConnect(RPostgreSQL::PostgreSQL()) # bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO edgar, public")
rs <- dbExecute(pg, "SET work_mem = '2GB'")
filings <- tbl(pg, "filings")
filing_docs <- tbl(pg, "filing_docs")
size_by_type <-
filings %>%
inner_join(filing_docs, by = "file_name") %>%
group_by(form_type) %>%
summarize(size = sum(size, na.rm = TRUE)) %>%
compute()
size_by_type %>%
filter(!is.na(size)) %>%
mutate(size = size/1e9) %>%
arrange(desc(size))
#> # Source: lazy query [?? x 2]
#> # Database: postgres 9.6.11 [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(size)
#> form_type size
#> <chr> <dbl>
#> 1 8-K 1317.
#> 2 10-K 1203.
#> 3 6-K 429.
#> 4 DEF 14A 208.
#> 5 4 128.
#> 6 10-K/A 77.5
#> 7 8-K/A 60.5
#> 8 13F-HR 47.0
#> 9 10-K405 15.8
#> 10 3 11.4
#> # … with more rows
Created on 2019-01-22 by the reprex package (v0.2.1)
Maybe I should organize a 6TB hard drive and then we should just download it.
@iangow As discussed, this issue is to check how much memory is needed to store all the documents associated with
filing_docs