Handle data files for filing documents

iangow commented 6 years ago

IF you look here you will see that there are two groups of "filing documents".

As seen here, the code in get_filing_docs.R only extracts the first table. For now, we should examine how easy it is to modify the function get_filing_docs() to handle these cases. (I think it will be possible to identify the cases with this issue so that we don't have to scrape everything again.)

bdcallen commented 6 years ago

@iangow This turns out to be very easy. The package dplyr has a function called bind_rows which can bind two dataframes with the same columns, and html_table() returns a list of dataframes when run on a list of html/xml nodes. So in get_filing_docs(), one simply changes

df <-
            table_nodes %>%
            .[[1]] %>%
            html_table() %>%
            mutate(file_name = file_name)

to

df <- table_nodes %>% html_table() %>% bind_rows() %>% mutate(file_name = file_name)

I executed the latter on the table_nodes produced for the example url you gave me in the first post, and it worked. My only question is, do all the tables have the same columns for the urls for each file name? If not, we will obviously have to do some further updates to handle different cases.

bdcallen commented 6 years ago

Here is code to delete the problem cases identified by mismatches between seq and the number of filings. We have already run this code, so the problem files are now gone.

library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)

pg <- dbConnect(PostgreSQL())

rs <- dbExecute(pg, "SET search_path TO edgar")
rs <- dbExecute(pg, "SET work_mem TO '10GB'")

filing_docs <- tbl(pg, "filing_docs")

problems <-
    filing_docs %>%
    group_by(file_name) %>%
    summarise(max_seq = max(seq, na.rm = TRUE),
              n_docs = n()) %>%
    mutate(problem = max_seq > n_docs) %>%
    filter(problem) %>%
    ungroup() %>%
    compute(name = "problems", temporary = FALSE,
            indexes = "file_name")

problems %>% count(problem)

filing_docs <- tbl(pg, "filing_docs")
problems <- tbl(pg, "problems")

problems %>%
    inner_join(filing_docs) %>%
    summarise(n_fs = n_distinct(file_name))

dbGetQuery(pg, "
    BEGIN;
    DELETE FROM edgar.filing_docs 
          WHERE file_name IN (SELECT file_name FROM edgar.problems)")
dbGetQuery(pg, "COMMIT; ")
dbGetQuery(pg, "DROP TABLE IF EXISTS edgar.problems; ")

problems %>%
    inner_join(filing_docs) %>%
    summarise(n_fs = n_distinct(file_name))

dbDisconnect(pg)

mccgr / edgar

Handle data files for filing documents #12