Closed iangow closed 6 years ago
@iangow This turns out to be very easy. The package dplyr has a function called bind_rows which can bind two dataframes with the same columns, and html_table() returns a list of dataframes when run on a list of html/xml nodes. So in get_filing_docs(), one simply changes
df <-
table_nodes %>%
.[[1]] %>%
html_table() %>%
mutate(file_name = file_name)
to
df <- table_nodes %>% html_table() %>% bind_rows() %>% mutate(file_name = file_name)
I executed the latter on the table_nodes produced for the example url you gave me in the first post, and it worked. My only question is, do all the tables have the same columns for the urls for each file name? If not, we will obviously have to do some further updates to handle different cases.
Here is code to delete the problem cases identified by mismatches between seq
and the number of filings. We have already run this code, so the problem files are now gone.
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
pg <- dbConnect(PostgreSQL())
rs <- dbExecute(pg, "SET search_path TO edgar")
rs <- dbExecute(pg, "SET work_mem TO '10GB'")
filing_docs <- tbl(pg, "filing_docs")
problems <-
filing_docs %>%
group_by(file_name) %>%
summarise(max_seq = max(seq, na.rm = TRUE),
n_docs = n()) %>%
mutate(problem = max_seq > n_docs) %>%
filter(problem) %>%
ungroup() %>%
compute(name = "problems", temporary = FALSE,
indexes = "file_name")
problems %>% count(problem)
filing_docs <- tbl(pg, "filing_docs")
problems <- tbl(pg, "problems")
problems %>%
inner_join(filing_docs) %>%
summarise(n_fs = n_distinct(file_name))
dbGetQuery(pg, "
BEGIN;
DELETE FROM edgar.filing_docs
WHERE file_name IN (SELECT file_name FROM edgar.problems)")
dbGetQuery(pg, "COMMIT; ")
dbGetQuery(pg, "DROP TABLE IF EXISTS edgar.problems; ")
problems %>%
inner_join(filing_docs) %>%
summarise(n_fs = n_distinct(file_name))
dbDisconnect(pg)
IF you look here you will see that there are two groups of "filing documents".
As seen here, the code in get_filing_docs.R only extracts the first table. For now, we should examine how easy it is to modify the function
get_filing_docs()
to handle these cases. (I think it will be possible to identify the cases with this issue so that we don't have to scrape everything again.)