Get Forms 3, 4, 5 filing docs

iangow commented 6 years ago

https://github.com/iangow-public/edgar/blob/97810d5248b9731e399b4876797107468c45ac68/get_filing_docs.R#L51

At a simple level, this is just a matter of editing the line above to include the relevant forms and running the code. But it might be worth testing on a sample to estimate how long this will take.

bdcallen commented 6 years ago

@iangow Just ran this

fil345 <-
+     filings %>%
+     filter(form_type %~% "^(3|4|5|3/A|4/A|5/A)$")

and this worked, with regard to the regex. Collects just Forms 3, 4, 5 as well as the corresponding amended forms. Note the dollar sign at the end, tried the above without it but found that many undesired forms were collected since it was only looking at the beginning of the strings.

bdcallen commented 6 years ago

Just redid the counts, and found pretty much a similar answer to what we had this morning

> fil345 %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 9.6.9 [bdcallen@/var/run/postgresql:5432/crsp]
         n
     <dbl>
1 7613711.

> fil345 %>% group_by(form_type) %>% count()
# Source:   lazy query [?? x 2]
# Database: postgres 9.6.9 [bdcallen@/var/run/postgresql:5432/crsp]
# Groups:   form_type
  form_type        n
  <chr>        <dbl>
1 5          195910.
2 4/A        213361.
3 4         6565113.
4 3/A         36586.
5 5/A          7716.
6 3          595025.

iangow commented 6 years ago

@bdcallen

A simple ^[345](/A)?$ should cover all these cases. We should still process (10-K|DEF 14|8-K) cases too. But it's OK to put them aside for now and add them back later.

bdcallen commented 6 years ago

@iangow Making a couple of small changes to get_filing_docs

get_filing_docs <- function(file_name) {

     head_url <- get_index_url(file_name)

     table_nodes <-
         read_html(head_url, encoding="Latin1") %>%
         html_nodes("table")

     if (length(table_nodes) < 1) {
         df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                      size = NA, file_name = file_name)
     } else {

         df <- table_nodes %>% html_table() %>% bind_rows() %>% mutate(file_name = file_name)

         colnames(df) <- tolower(colnames(df))
         df$type <- as.character(df$type)        # This is necessary, since if we do this function on Forms 3, 4, and 5, it will usually convert the type column to integer in the above
     }

     pg <- dbConnect(PostgreSQL())
     dbWriteTable(pg, c("edgar", "filing_docs_test"),
                  df, append = TRUE, row.names = FALSE)
     dbDisconnect(pg)
     return(TRUE)

 }

namely having it write to a test table filing_docs_test (which I will delete later), I ran

> system.time(temp <- lapply(file_names$file_name[1:1000], get_filing_docs))
   user  system elapsed 
120.718  50.162 483.547

So it took roughly 8 minutes to process 1000 filings. This translates to about 42 days for the full set of around 7.6 million filings, going by the elapsed time. I think it is also important to note that we have not used multiple cores here

iangow commented 6 years ago

Nowadays, I try to avoid the df$type syntax, as it's less portable.

Going parallel is tricky, as the SEC website will quickly boot you off or cause timeouts, and this leads to errors (I got errors on two cores out of eight using the code below).

I don't think we need to worry too much about this for now. I think if we just take a random sample of filings (say, 100,000), we could develop the code to process those and think about how to get the filings in the background. Worst case: It just takes a month to scrape the data.

It might be possible to speed things up by collecting more pages before writing to the database.

library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)

get_index_url <- function(file_name) {
    matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
    cik <- matches[2]
    acc_no <- matches[3]
    path <- stringr::str_replace_all(acc_no, "[^\\d]", "")

    url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
                  acc_no, "-index.htm")
    return(url)
}

fix_names <- function(df) {
    colnames(df) <- tolower(colnames(df)) 
    df
}

get_filing_docs <- function(file_name) {

     head_url <- get_index_url(file_name)

     table_nodes <-
         read_html(head_url, encoding="Latin1") %>%
         html_nodes("table")

     if (length(table_nodes) < 1) {
         df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                      size = NA, file_name = file_name)
     } else {

         df <- 
             table_nodes %>%
             html_table() %>% 
             bind_rows() %>% 
             fix_names() %>%
             mutate(file_name = file_name,
                    type = as.character(type))
         df
     }

     pg <- dbConnect(PostgreSQL())
     dbWriteTable(pg, c("edgar", "filing_docs_test"),
                  df, append = TRUE, row.names = FALSE)
     dbDisconnect(pg)
     return(TRUE)

 }

pg <- dbConnect(PostgreSQL())

filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))

def14_a <-
    filings %>%
    filter(form_type %~% "^[345](/A)?$") 

new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))

if (!new_table) {
    filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
    def14_a <- def14_a %>% anti_join(filing_docs, by = "file_name")
}

file_names <-
    def14_a %>%
    select(file_name) %>%
    distinct() %>%
    collect(n = 1000)
rs <- dbDisconnect(pg)

library(parallel)
system.time(temp <- mclapply(file_names$file_name, get_filing_docs, mc.cores = 8))

iangow commented 6 years ago

This is about 20 seconds for 100 filings. I think that writing the resulting data frame to PostgreSQL would not take long. So maybe some variant of this would be faster:

get_filing_docs <- function(file_name) {

     head_url <- get_index_url(file_name)

     table_nodes <-
         read_html(head_url, encoding="Latin1") %>%
         html_nodes("table")

     if (length(table_nodes) < 1) {
         df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                      size = NA, file_name = file_name)
     } else {

         df <- 
             table_nodes %>%
             html_table() %>% 
             bind_rows() %>% 
             fix_names() %>%
             mutate(file_name = file_name,
                    type = as.character(type))
         df
     }

     df 
}

pg <- dbConnect(PostgreSQL())

filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))

def14_a <-
    filings %>%
    filter(form_type %~% "^[345](/A)?$") 

new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))

if (!new_table) {
    filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
    def14_a <- def14_a %>% anti_join(filing_docs, by = "file_name")
}

file_names <-
    def14_a %>%
    select(file_name) %>%
    distinct() %>%
    collect(n = 1000)
rs <- dbDisconnect(pg)

system.time(temp <- bind_rows(mclapply(file_names$file_name[1:100], get_filing_docs)))

bdcallen commented 6 years ago

@iangow

A simple ^[345](/A)?$ should cover all these cases.

I redefined fil345 using this, and this indeed does the job, yields the same result.

bdcallen commented 6 years ago

@iangow I guess this covers other issues as well as this one, but in the course of approaching this, I started thinking about turning get_filing_docs.R into a module/package called get_filing_doc_functions.R, since then we can use these same functions on the 3's, 4's and 5's. To do this, I encapsulated the previous part of the code which downloaded the filings into a function get_filing_docs_by_type

get_filing_docs_by_type <- function(type_regex) {
# A function which takes as an argument a regular expression which catches filings of one or several types, then downloads their filing documents into the database

    pg <- dbConnect(PostgreSQL())

    filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))

    type_filings <-
        filings %>%
        filter(form_type %~% type_regex)

    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))

    if (!new_table) {
        filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
        type_filings <- type_filings %>% anti_join(filing_docs, by = "file_name")
    }

    file_names <-
        type_filings %>%
        select(file_name) %>%
        distinct() %>%
        collect()
    rs <- dbDisconnect(pg)

    system.time(temp <- lapply(file_names$file_name, get_filing_docs))

    if (new_table) {
        pg <- dbConnect(PostgreSQL())
        rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs (file_name)")
        rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs OWNER TO edgar")
        rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs TO edgar_access")

        rs <- dbDisconnect(pg)
    }

    temp <- unlist(temp)

}

I'm then thinking of redefining get_filing_docs.R to be this

#!/usr/bin/env Rscript
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)
source(get_filing_doc_functions.R)

def14_a_regex <- "^(10-K|DEF 14|8-K)"
time_taken <- system.time(get_filing_docs_by_type(def14_a_regex))
print(time_taken)

(we may want to call this something else, since this just does the 10-K's, DEF 14's and 8-K's).

What do you think?

bdcallen commented 6 years ago

@iangow I have just made a couple of commits improving on my suggestion above. Have named the new programs get_filing_doc_functions.R and get_form_345_filing_docs.R.

bdcallen commented 6 years ago

In get_filing_doc_functions.R, I essentially split the main program part of get_filing_docs.R into two functions. The function get_filings_by_type extracts the file names of the filings from edgar.filings for which the form_type conforms to type_regex, which do not already appear in edgar.filing_docs. The function process_filings can then take a subset of these filings, and processes get_filing_docs on them, allowing us to process the set in batches as can be seen in get_form_345_filing_docs.R

bdcallen commented 6 years ago

@iangow Over yesterday and today, I have had a go at running get_form_345_filing_docs.R several times, and taken note of how long it is taking to download the data. My times were consistent with the code you put above (around 20 seconds for 100 filings, ie. 0.2 seconds a filing). This translates to about 17-18 days for the full set.

Furthermore, I think it might be useful to have some try/catch mechanism for the function get_filing_docs, since I keep getting html timeouts relatively early on (around 50-60 batches of 200 filings in).

bdcallen commented 6 years ago

@iangow I just tried process_filings defined as

process_filings <- function(filings_df) {

    pg <- dbConnect(PostgreSQL())
    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))

    system.time(temp <- mclapply(filings_df$file_name, get_filing_docs, mc.cores = 24))

    if (new_table) {
        rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs (file_name)")
        rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs OWNER TO edgar")
        rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs TO edgar_access")
    }

    rs <- dbDisconnect(pg)

    temp <- unlist(temp)

    return(temp)

}

on a unprocessed sample of 200 filings, with get_filing_docs having the new definition I gave it this morning, with the try-catch block

get_filing_docs <- function(file_name) {

    try({head_url <- get_index_url(file_name)

    table_nodes <-
        read_html(head_url, encoding="Latin1") %>%
        html_nodes("table")

    if (length(table_nodes) < 1) {
        df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                     size = NA, file_name = file_name)
    } else {

        df <- table_nodes %>% html_table() %>% bind_rows() %>% fix_names() %>% mutate(file_name = file_name, type = as.character(type))

        colnames(df) <- tolower(colnames(df))
    }

    pg <- dbConnect(PostgreSQL())
    dbWriteTable(pg, c("edgar", "filing_docs"),
                 df, append = TRUE, row.names = FALSE)
    dbDisconnect(pg)

    return(TRUE)}, {return(FALSE)})

}

Hence, get_filing_docs returns false if there's a html error.

Using process_filings

> system.time(temp <- process_filings(fil345[14601:14800, ]))
   user  system elapsed 
156.760  81.584  29.555

I got that

> 29.555/200
[1] 0.147775
> 0.147775 * 7600000
[1] 1123090
> 1123090/(3600 * 24)
[1] 12.99873

the program would take just under 13 days to run through the full set, with an failure-to-download rate of 6.5%

> sum(temp)
[1] 187
> sum(temp)/length(temp)
[1] 0.935

Note that I made a mistake before with calculating the time to take with lapply, since I had a mistake in defining my starting and ending points of each bunch of 200 filings (see one of my commits). After correcting for the mistake, I found it was not consistent with 0.2 seconds per filing, but rather 0.5 seconds per filing, which would take roughly 7 weeks. Using mclapply with 24 cores provides a significant speed up on this, though at the price of an increased failure rate, though I don't think this failure rate is too large, and it might be useful to just do a run-through the whole set knowing we'll have to repeat the program on the filings which have not been processed. What do you think?

bdcallen commented 6 years ago

@iangow Just been running the program using 24 cores, every 10000 filings the program prints out how many have been successfully processed, as well as the program running time. I have got so far

[1] "9261 processed out of 10000"
[1] "Total time taken: \n"
    user   system  elapsed 
8837.529 4781.405 1631.536 
[1] "18525 processed out of 20000"
[1] "Total time taken: \n"
     user    system   elapsed 
18078.734  9763.072  3245.932

Hence, taking about 27 min to do each 10000 filings, with a failure rate around 7.5%

iangow commented 6 years ago

7e6/1e4 = 7e2 => 350 hours => ~15 days. But perhaps less if we focus on the ones we expect to have .xml documents.

bdcallen commented 6 years ago

@iangow I've been letting the program go for nearly a day now, and this

SELECT COUNT(DISTINCT file_name) FROM edgar.filing_docs WHERE type in ('3', '4', '5')

gave 501822, so half a million filings have been successfully processed. I am happy to just to let it continue at the moment. I intend to come in on Friday to continue on with some of the other tasks, should be interesting to see the progress it has made by then.

bdcallen commented 6 years ago

@iangow just another update. Doing this

SELECT COUNT(DISTINCT file_name) FROM edgar.filing_docs WHERE file_name IN (SELECT file_name FROM edgar.filings WHERE form_type IN ('3', '4', '5', '3/A', '4/A', '5/A'))

just gave 4300390, and just a little beforehand, doing this

SELECT COUNT(*) FROM edgar.filing_docs WHERE file_name IN (SELECT file_name FROM edgar.filings WHERE form_type IN ('3', '4', '5', '3/A', '4/A', '5/A'))

gave 13132751. So 4.3 million filings have been processed, with about 3 documents per filing on average, meaning there should be around 23 million documents in edgar.filing_docs by the time all filings have been processed.

bdcallen commented 6 years ago

@iangow Ok, the program has just finished, and I have run it till it couldn't process any more filings. There turned out to be 13 filings which the program couldn't process

urls345 <- unlist(lapply(fil345$file_name, get_index_url))
> urls345
 [1] "https://www.sec.gov/Archives/edgar/data/1187816/000172385417000007/0001723854-17-000007-index.htm"
 [2] "https://www.sec.gov/Archives/edgar/data/1437790/000143779508000004/0001437795-08-000004-index.htm"
 [3] "https://www.sec.gov/Archives/edgar/data/1437793/000143779508000004/0001437795-08-000004-index.htm"
 [4] "https://www.sec.gov/Archives/edgar/data/1437794/000143779508000004/0001437795-08-000004-index.htm"
 [5] "https://www.sec.gov/Archives/edgar/data/1437795/000143779508000004/0001437795-08-000004-index.htm"
 [6] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031834/0001209191-18-031834-index.htm"
 [7] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031841/0001209191-18-031841-index.htm"
 [8] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031846/0001209191-18-031846-index.htm"
 [9] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031851/0001209191-18-031851-index.htm"
[10] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031852/0001209191-18-031852-index.htm"
[11] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031856/0001209191-18-031856-index.htm"
[12] "https://www.sec.gov/Archives/edgar/data/1716885/000172385417000007/0001723854-17-000007-index.htm"
[13] "https://www.sec.gov/Archives/edgar/data/789863/000143779508000004/0001437795-08-000004-index.htm"

Most of these give a page which states that the filing is "temporarily unavailable", but there are some others (eg. this) which lead to filings with multiple tables.

I guess we could make a small issue for these filings, and we need to make another on updating the filings continuously, but other than that I think we can consider this issue closed.

bdcallen commented 5 years ago

@iangow I think we can close this issue, as it was about the generic problem of scraping Form 3, 4 and 5 filings, and getting the bulk of them. We could open new issues to do an update and set up a cronjob.

mccgr / edgar

Get Forms 3, 4, 5 filing docs #22