Closed bdcallen closed 6 years ago
@bdcallen This is good. I think we don't need to download .xml
files (the former case) and I'm not sure what we would do with the extract information for the DEF 14As (the latter case). So I think we can defer this one for now. I think there might be enough information here to make this easy to pick up later. If you had code to produce the sample that you tested and the code for testing this, it might be good to paste that here if it doesn't have a natural place in the code repository.
@iangow Here's the code I used to generate a random sample.
problems <- filing_docs %>% group_by(file_name) %>% summarise(n_docs = n()) %>% data.frame()
rand_indices <- sample(1:nrow(problems), nrow(problems), replace = FALSE)
problems_shuffled <- problems[rand_indices, ]
problems <- NULL
prob_subsets <- list()
for (i in 1:10) {
prob_subsets[[i]] <- problems_shuffled[((i-1)*1000 + 1):(i * 1000), ]
prob_subsets[[i]]$num_tables <- unlist(mclapply(prob_subsets[[i]]$file_name, get_num_tables, mc.cores = 4))
}
prob_subset <- bind_rows(prob_subsets)
files <- data.frame(files)
files <- tbl(pg, sql("SELECT * FROM edgar.filings"))
prob_subset <- inner_join(prob_subset, files, by = "file_name", copy = TRUE)
where
get_filing_docs_table_nodes <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
return(table_nodes)
}
get_num_tables <- function(file_name) {
table_nodes <- get_filing_docs_table_nodes(file_name)
num_tables <- length(table_nodes)
return(num_tables)
}
Funnily enough, the number of rows went down from 10000 to 9925 after the inner join, so perhaps there are some missing files (though the number of files with more than one table did not change, so perhaps some files missing which aren't the ones we got rid of). The sample contained 219 filings which had more than one table, corresponding to 2.19% of the full 10000 sample.
After this, I ran
mult_table_subset <- prob_subset[prob_subset$num_tables > 1, ]
100.00 * table(mult_table_subset$form_type)/nrow(mult_table_subset)
to obtain the form types of the filings with more than two tables by percentage and got the result
10-K 10-K/A 10-KT 6-K 6-K/A 8-K DEF 14A DEF 14C
77.625571 4.109589 0.456621 1.826484 0.456621 2.739726 9.132420 3.652968
So indeed, roughly 80% are 10-K's, about 5% being either 6-K's or 8-K's, and the rest being DEF 14A's or DEF 14C's.
OK. I used slightly different code:
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest)
get_filing_docs_table_nodes <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
return(table_nodes)
}
get_index_url <- function(file_name) {
matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
cik <- matches[2]
acc_no <- matches[3]
path <- stringr::str_replace_all(acc_no, "[^\\d]", "")
url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
acc_no, "-index.htm")
return(url)
}
get_num_tables <- function(file_name) {
table_nodes <- get_filing_docs_table_nodes(file_name)
num_tables <- length(table_nodes)
return(num_tables)
}
pg <- dbConnect(PostgreSQL())
rs <- dbExecute(pg, "SET search_path TO edgar")
filing_docs <- tbl(pg, "filing_docs")
filings <- tbl(pg, "filings")
rand_filings <-
filing_docs %>%
select(file_name) %>%
distinct() %>%
mutate(rand = random()) %>%
top_n(10000) %>%
compute()
problems <-
rand_filings %>%
group_by(file_name) %>%
summarise(n_docs = n()) %>%
ungroup() %>%
collect()
system.time({
problems_mod <-
problems %>%
rowwise() %>%
mutate(num_tables = get_num_tables(file_name))
})
problems_mod <- copy_to(pg, problems_mod, overwrite=TRUE)
problems_mod %>%
mutate(problem = num_tables > 1) %>%
inner_join(filings) %>%
group_by(form_type) %>%
summarise(prop_problems = sum(as.integer(problem), na.rm=TRUE)*1.0/n()) %>%
filter(prop_problems > 0) %>%
arrange(desc(prop_problems))
dbDisconnect(pg)
I also looked a different statistic.
> problems_mod %>%
+ mutate(problem = num_tables > 1) %>%
+ inner_join(filings) %>%
+ group_by(form_type) %>%
+ summarise(prop_problems = sum(as.integer(problem), na.rm=TRUE)*1.0/n()) %>%
+ filter(prop_problems > 0) %>%
+ arrange(desc(prop_problems))
Joining, by = "file_name"
# Source: lazy query [?? x 2]
# Database: postgres 9.6.8 [igow@localhost:5432/crsp]
# Ordered by: desc(prop_problems)
form_type prop_problems
<chr> <dbl>
1 10-K 0.291
2 DEF 14C 0.163
3 10-K/A 0.0769
4 DEF 14A 0.0165
5 8-K/A 0.00338
6 6-K 0.00287
7 8-K 0.00127
I think based on this we'd only worry about the first 3 or 4 file_types
.
It might make sense to create an issue for 10-K
and 10-K/A
cases and to create a separate issue for DEF 14C
cases. When there are multiple tables, is there any information that we'd want to download? (Not sure we care about .xml
files at this stage.)
Previously, in issue #12, we handled the filing documents for which there are multiple tables, in which the sequence numbers exceeded the length of the first table that was downloaded. However, there still exist some cases where the sequence numbers are unbroken between the tables, see for instance this.
Having done some preliminary statistics on these cases, it seems that the vast majority of these are 10-K's or 10-K/A's (roughly 80%), with some 6-K's and 8-K's. There are also some other cases with DEF 14's, where the additional table(s) do not contain files, but do contain other information, for example, this.