Complete filing_docs - Githubissues

iangow commented 5 years ago

To date, we have focused filing_docs on certain filings. It seems we now have most of the filings processed in this table. Perhaps it makes sense to process the rest. The first step would be to estimate how much time this will require, both as a one-off update and as regular updates of the full table (e.g., every day or every week).

library(DBI)
library(dplyr, warn.conflicts = FALSE)
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO edgar, public")
rs <- dbExecute(pg, "SET work_mem = '10GB'")
filings <- tbl(pg, "filings")
filing_docs <- tbl(pg, "filing_docs")

processed <-
    filing_docs %>% 
    distinct(file_name) %>%
    mutate(processed = TRUE)

raw_data <-
    filings %>% 
    left_join(processed) %>%
    mutate(processed = coalesce(processed, FALSE)) %>%
    compute()
#> Joining, by = "file_name"

raw_data %>% count(processed)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   processed        n
#>   <lgl>        <int>
#> 1 FALSE      7684233
#> 2 TRUE      10450475

raw_data %>%
    filter(!processed) %>%
    count(form_type) %>%
    arrange(desc(n)) %>%
    print(n=40)
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>    form_type      n
#>    <chr>      <int>
#>  1 SC 13G/A  634591
#>  2 10-Q      571555
#>  3 497       411439
#>  4 SC 13G    354277
#>  5 424B3     274404
#>  6 424B2     231061
#>  7 SC 13D/A  218746
#>  8 D         208182
#>  9 CORRESP   183725
#> 10 UPLOAD    181266
#> 11 485BPOS   170096
#> 12 497K      166910
#> 13 24F-2NT   160888
#> 14 D/A       142486
#> 15 FWP       128881
#> 16 REGDEX    126758
#> 17 10QSB     119141
#> 18 497J      114486
#> 19 424B5     112150
#> 20 EFFECT    108964
#> 21 SC 13D    104784
#> 22 S-4/A     101171
#> 23 425        94280
#> 24 N-Q        90672
#> 25 DEFA14A    85292
#> 26 S-4        84572
#> 27 X-17A-5    78646
#> 28 S-8        78392
#> 29 N-30D      72924
#> 30 NSAR-A     72885
#> 31 NSAR-B     72187
#> 32 NT 10-Q    68440
#> 33 10-D       63865
#> 34 4          57100
#> 35 SUPPL      55562
#> 36 N-CSR      55343
#> 37 S-1/A      54132
#> 38 S-3        50031
#> 39 40-17G     48646
#> 40 N-CSRS     48223
#> # ... with more rows

^{Created on 2018-11-23 by the reprex package (v0.2.1)}

iangow commented 5 years ago

Also, we should work out what antecedent tables need updating (perhaps it's just filings).

And I think we need to start backing up this table, as it is costly to produce.

bdcallen commented 4 years ago

@iangow Close this?

iangow commented 4 years ago

Not sure. It should be pretty easy to adapt the code above to work out whether there are gaps remaining (basically compare processed with filings and perhaps group_by(form_type) and then count).

iangow commented 4 years ago

Not sure. It should be pretty easy to adapt the code above to work out whether there are gaps remaining (basically compare processed with filings and perhaps group_by(form_type) and then count).

@bdcallen Do you want to do that? Then I think we could close this.

mccgr / edgar

Complete filing_docs #40