mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Handling data files for filing documents with more than one table, with sequence unbroken #16

Closed bdcallen closed 6 years ago

bdcallen commented 6 years ago

Previously, in issue #12, we handled the filing documents for which there are multiple tables, in which the sequence numbers exceeded the length of the first table that was downloaded. However, there still exist some cases where the sequence numbers are unbroken between the tables, see for instance this.

Having done some preliminary statistics on these cases, it seems that the vast majority of these are 10-K's or 10-K/A's (roughly 80%), with some 6-K's and 8-K's. There are also some other cases with DEF 14's, where the additional table(s) do not contain files, but do contain other information, for example, this.

iangow commented 6 years ago

@bdcallen This is good. I think we don't need to download .xml files (the former case) and I'm not sure what we would do with the extract information for the DEF 14As (the latter case). So I think we can defer this one for now. I think there might be enough information here to make this easy to pick up later. If you had code to produce the sample that you tested and the code for testing this, it might be good to paste that here if it doesn't have a natural place in the code repository.

bdcallen commented 6 years ago

@iangow Here's the code I used to generate a random sample.

problems <- filing_docs %>% group_by(file_name) %>% summarise(n_docs = n()) %>% data.frame()

rand_indices <- sample(1:nrow(problems), nrow(problems), replace = FALSE)
problems_shuffled <- problems[rand_indices, ]
problems <- NULL

prob_subsets <- list()

for (i in 1:10) {

        prob_subsets[[i]] <- problems_shuffled[((i-1)*1000 + 1):(i * 1000), ]
        prob_subsets[[i]]$num_tables <- unlist(mclapply(prob_subsets[[i]]$file_name, get_num_tables, mc.cores = 4))

}

prob_subset <- bind_rows(prob_subsets)
files <- data.frame(files)
files <- tbl(pg, sql("SELECT * FROM edgar.filings"))

prob_subset <- inner_join(prob_subset, files, by = "file_name", copy = TRUE)

where

get_filing_docs_table_nodes <- function(file_name) {

    head_url <- get_index_url(file_name)

    table_nodes <-
        read_html(head_url, encoding="Latin1") %>%
        html_nodes("table")

    return(table_nodes)

}

get_num_tables <- function(file_name) {

    table_nodes <- get_filing_docs_table_nodes(file_name)
    num_tables <- length(table_nodes)
    return(num_tables)

}

Funnily enough, the number of rows went down from 10000 to 9925 after the inner join, so perhaps there are some missing files (though the number of files with more than one table did not change, so perhaps some files missing which aren't the ones we got rid of). The sample contained 219 filings which had more than one table, corresponding to 2.19% of the full 10000 sample.

After this, I ran

mult_table_subset <- prob_subset[prob_subset$num_tables > 1, ]
100.00 * table(mult_table_subset$form_type)/nrow(mult_table_subset)

to obtain the form types of the filings with more than two tables by percentage and got the result

     10-K    10-K/A     10-KT       6-K     6-K/A       8-K   DEF 14A   DEF 14C 
77.625571  4.109589  0.456621  1.826484  0.456621  2.739726  9.132420  3.652968 

So indeed, roughly 80% are 10-K's, about 5% being either 6-K's or 8-K's, and the rest being DEF 14A's or DEF 14C's.

iangow commented 6 years ago

OK. I used slightly different code:

library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest)

get_filing_docs_table_nodes <- function(file_name) {

    head_url <- get_index_url(file_name)

    table_nodes <-
        read_html(head_url, encoding="Latin1") %>%
        html_nodes("table")
    return(table_nodes)

}

get_index_url <- function(file_name) {
    matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
    cik <- matches[2]
    acc_no <- matches[3]
    path <- stringr::str_replace_all(acc_no, "[^\\d]", "")

    url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
                  acc_no, "-index.htm")
    return(url)
}

get_num_tables <- function(file_name) {

    table_nodes <- get_filing_docs_table_nodes(file_name)
    num_tables <- length(table_nodes)
    return(num_tables)
}

pg <- dbConnect(PostgreSQL())
rs <- dbExecute(pg, "SET search_path TO edgar")

filing_docs <- tbl(pg, "filing_docs")
filings <- tbl(pg, "filings")

rand_filings <-
    filing_docs %>%
    select(file_name) %>%
    distinct() %>%
    mutate(rand = random()) %>%
    top_n(10000) %>%
    compute()

problems <- 
    rand_filings %>% 
    group_by(file_name) %>% 
    summarise(n_docs = n()) %>%
    ungroup() %>%
    collect() 

system.time({
    problems_mod <-
        problems %>%
        rowwise() %>%
        mutate(num_tables = get_num_tables(file_name))
})

problems_mod <- copy_to(pg, problems_mod, overwrite=TRUE)

problems_mod %>%
    mutate(problem = num_tables > 1) %>%
    inner_join(filings) %>% 
    group_by(form_type) %>%
    summarise(prop_problems = sum(as.integer(problem), na.rm=TRUE)*1.0/n()) %>%
    filter(prop_problems > 0) %>%
    arrange(desc(prop_problems))

dbDisconnect(pg)

I also looked a different statistic.

> problems_mod %>%
+     mutate(problem = num_tables > 1) %>%
+     inner_join(filings) %>% 
+     group_by(form_type) %>%
+     summarise(prop_problems = sum(as.integer(problem), na.rm=TRUE)*1.0/n()) %>%
+     filter(prop_problems > 0) %>%
+     arrange(desc(prop_problems))
Joining, by = "file_name"
# Source:     lazy query [?? x 2]
# Database:   postgres 9.6.8 [igow@localhost:5432/crsp]
# Ordered by: desc(prop_problems)
  form_type prop_problems
  <chr>             <dbl>
1 10-K            0.291  
2 DEF 14C         0.163  
3 10-K/A          0.0769 
4 DEF 14A         0.0165 
5 8-K/A           0.00338
6 6-K             0.00287
7 8-K             0.00127

I think based on this we'd only worry about the first 3 or 4 file_types.

iangow commented 6 years ago

It might make sense to create an issue for 10-K and 10-K/A cases and to create a separate issue for DEF 14C cases. When there are multiple tables, is there any information that we'd want to download? (Not sure we care about .xml files at this stage.)