Closed iangow closed 5 years ago
@iangow Just ran this
fil345 <-
+ filings %>%
+ filter(form_type %~% "^(3|4|5|3/A|4/A|5/A)$")
and this worked, with regard to the regex. Collects just Forms 3, 4, 5 as well as the corresponding amended forms. Note the dollar sign at the end, tried the above without it but found that many undesired forms were collected since it was only looking at the beginning of the strings.
Just redid the counts, and found pretty much a similar answer to what we had this morning
> fil345 %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 9.6.9 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 7613711.
> fil345 %>% group_by(form_type) %>% count()
# Source: lazy query [?? x 2]
# Database: postgres 9.6.9 [bdcallen@/var/run/postgresql:5432/crsp]
# Groups: form_type
form_type n
<chr> <dbl>
1 5 195910.
2 4/A 213361.
3 4 6565113.
4 3/A 36586.
5 5/A 7716.
6 3 595025.
@bdcallen
A simple ^[345](/A)?$
should cover all these cases. We should still process (10-K|DEF 14|8-K)
cases too. But it's OK to put them aside for now and add them back later.
@iangow Making a couple of small changes to get_filing_docs
get_filing_docs <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
if (length(table_nodes) < 1) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
} else {
df <- table_nodes %>% html_table() %>% bind_rows() %>% mutate(file_name = file_name)
colnames(df) <- tolower(colnames(df))
df$type <- as.character(df$type) # This is necessary, since if we do this function on Forms 3, 4, and 5, it will usually convert the type column to integer in the above
}
pg <- dbConnect(PostgreSQL())
dbWriteTable(pg, c("edgar", "filing_docs_test"),
df, append = TRUE, row.names = FALSE)
dbDisconnect(pg)
return(TRUE)
}
namely having it write to a test table filing_docs_test
(which I will delete later), I ran
> system.time(temp <- lapply(file_names$file_name[1:1000], get_filing_docs))
user system elapsed
120.718 50.162 483.547
So it took roughly 8 minutes to process 1000 filings. This translates to about 42 days for the full set of around 7.6 million filings, going by the elapsed time. I think it is also important to note that we have not used multiple cores here
Nowadays, I try to avoid the df$type
syntax, as it's less portable.
Going parallel is tricky, as the SEC website will quickly boot you off or cause timeouts, and this leads to errors (I got errors on two cores out of eight using the code below).
I don't think we need to worry too much about this for now. I think if we just take a random sample of filings (say, 100,000), we could develop the code to process those and think about how to get the filings in the background. Worst case: It just takes a month to scrape the data.
It might be possible to speed things up by collecting more pages before writing to the database.
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)
get_index_url <- function(file_name) {
matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
cik <- matches[2]
acc_no <- matches[3]
path <- stringr::str_replace_all(acc_no, "[^\\d]", "")
url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
acc_no, "-index.htm")
return(url)
}
fix_names <- function(df) {
colnames(df) <- tolower(colnames(df))
df
}
get_filing_docs <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
if (length(table_nodes) < 1) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
} else {
df <-
table_nodes %>%
html_table() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type))
df
}
pg <- dbConnect(PostgreSQL())
dbWriteTable(pg, c("edgar", "filing_docs_test"),
df, append = TRUE, row.names = FALSE)
dbDisconnect(pg)
return(TRUE)
}
pg <- dbConnect(PostgreSQL())
filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))
def14_a <-
filings %>%
filter(form_type %~% "^[345](/A)?$")
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))
if (!new_table) {
filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
def14_a <- def14_a %>% anti_join(filing_docs, by = "file_name")
}
file_names <-
def14_a %>%
select(file_name) %>%
distinct() %>%
collect(n = 1000)
rs <- dbDisconnect(pg)
library(parallel)
system.time(temp <- mclapply(file_names$file_name, get_filing_docs, mc.cores = 8))
This is about 20 seconds for 100 filings. I think that writing the resulting data frame to PostgreSQL would not take long. So maybe some variant of this would be faster:
get_filing_docs <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
if (length(table_nodes) < 1) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
} else {
df <-
table_nodes %>%
html_table() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type))
df
}
df
}
pg <- dbConnect(PostgreSQL())
filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))
def14_a <-
filings %>%
filter(form_type %~% "^[345](/A)?$")
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))
if (!new_table) {
filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
def14_a <- def14_a %>% anti_join(filing_docs, by = "file_name")
}
file_names <-
def14_a %>%
select(file_name) %>%
distinct() %>%
collect(n = 1000)
rs <- dbDisconnect(pg)
system.time(temp <- bind_rows(mclapply(file_names$file_name[1:100], get_filing_docs)))
@iangow
A simple
^[345](/A)?$
should cover all these cases.
I redefined fil345
using this, and this indeed does the job, yields the same result.
@iangow I guess this covers other issues as well as this one, but in the course of approaching this, I started thinking about turning get_filing_docs.R
into a module/package called get_filing_doc_functions.R
, since then we can use these same functions on the 3's, 4's and 5's. To do this, I encapsulated the previous part of the code which downloaded the filings into a function get_filing_docs_by_type
get_filing_docs_by_type <- function(type_regex) {
# A function which takes as an argument a regular expression which catches filings of one or several types, then downloads their filing documents into the database
pg <- dbConnect(PostgreSQL())
filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))
type_filings <-
filings %>%
filter(form_type %~% type_regex)
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))
if (!new_table) {
filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
type_filings <- type_filings %>% anti_join(filing_docs, by = "file_name")
}
file_names <-
type_filings %>%
select(file_name) %>%
distinct() %>%
collect()
rs <- dbDisconnect(pg)
system.time(temp <- lapply(file_names$file_name, get_filing_docs))
if (new_table) {
pg <- dbConnect(PostgreSQL())
rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs (file_name)")
rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs OWNER TO edgar")
rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs TO edgar_access")
rs <- dbDisconnect(pg)
}
temp <- unlist(temp)
}
I'm then thinking of redefining get_filing_docs.R
to be this
#!/usr/bin/env Rscript
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)
source(get_filing_doc_functions.R)
def14_a_regex <- "^(10-K|DEF 14|8-K)"
time_taken <- system.time(get_filing_docs_by_type(def14_a_regex))
print(time_taken)
(we may want to call this something else, since this just does the 10-K's, DEF 14's and 8-K's).
What do you think?
@iangow I have just made a couple of commits improving on my suggestion above. Have named the new programs get_filing_doc_functions.R
and get_form_345_filing_docs.R
.
In get_filing_doc_functions.R
, I essentially split the main program part of get_filing_docs.R
into two functions. The function get_filings_by_type
extracts the file names of the filings from edgar.filings
for which the form_type
conforms to type_regex
, which do not already appear in edgar.filing_docs
. The function process_filings
can then take a subset of these filings, and processes get_filing_docs
on them, allowing us to process the set in batches as can be seen in get_form_345_filing_docs.R
@iangow Over yesterday and today, I have had a go at running get_form_345_filing_docs.R
several times, and taken note of how long it is taking to download the data. My times were consistent with the code you put above (around 20 seconds for 100 filings, ie. 0.2 seconds a filing). This translates to about 17-18 days for the full set.
Furthermore, I think it might be useful to have some try/catch mechanism for the function get_filing_docs
, since I keep getting html timeouts relatively early on (around 50-60 batches of 200 filings in).
@iangow I just tried process_filings
defined as
process_filings <- function(filings_df) {
pg <- dbConnect(PostgreSQL())
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs"))
system.time(temp <- mclapply(filings_df$file_name, get_filing_docs, mc.cores = 24))
if (new_table) {
rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs (file_name)")
rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs OWNER TO edgar")
rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs TO edgar_access")
}
rs <- dbDisconnect(pg)
temp <- unlist(temp)
return(temp)
}
on a unprocessed sample of 200 filings, with get_filing_docs
having the new definition I gave it this morning, with the try-catch block
get_filing_docs <- function(file_name) {
try({head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
if (length(table_nodes) < 1) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
} else {
df <- table_nodes %>% html_table() %>% bind_rows() %>% fix_names() %>% mutate(file_name = file_name, type = as.character(type))
colnames(df) <- tolower(colnames(df))
}
pg <- dbConnect(PostgreSQL())
dbWriteTable(pg, c("edgar", "filing_docs"),
df, append = TRUE, row.names = FALSE)
dbDisconnect(pg)
return(TRUE)}, {return(FALSE)})
}
Hence, get_filing_docs
returns false if there's a html error.
Using process_filings
> system.time(temp <- process_filings(fil345[14601:14800, ]))
user system elapsed
156.760 81.584 29.555
I got that
> 29.555/200
[1] 0.147775
> 0.147775 * 7600000
[1] 1123090
> 1123090/(3600 * 24)
[1] 12.99873
the program would take just under 13 days to run through the full set, with an failure-to-download rate of 6.5%
> sum(temp)
[1] 187
> sum(temp)/length(temp)
[1] 0.935
Note that I made a mistake before with calculating the time to take with lapply, since I had a mistake in defining my starting and ending points of each bunch of 200 filings (see one of my commits). After correcting for the mistake, I found it was not consistent with 0.2 seconds per filing, but rather 0.5 seconds per filing, which would take roughly 7 weeks. Using mclapply
with 24 cores provides a significant speed up on this, though at the price of an increased failure rate, though I don't think this failure rate is too large, and it might be useful to just do a run-through the whole set knowing we'll have to repeat the program on the filings which have not been processed. What do you think?
@iangow Just been running the program using 24 cores, every 10000 filings the program prints out how many have been successfully processed, as well as the program running time. I have got so far
[1] "9261 processed out of 10000"
[1] "Total time taken: \n"
user system elapsed
8837.529 4781.405 1631.536
[1] "18525 processed out of 20000"
[1] "Total time taken: \n"
user system elapsed
18078.734 9763.072 3245.932
Hence, taking about 27 min to do each 10000 filings, with a failure rate around 7.5%
7e6/1e4 = 7e2 => 350 hours => ~15 days. But perhaps less if we focus on the ones we expect to have .xml
documents.
@iangow I've been letting the program go for nearly a day now, and this
SELECT COUNT(DISTINCT file_name) FROM edgar.filing_docs WHERE type in ('3', '4', '5')
gave 501822, so half a million filings have been successfully processed. I am happy to just to let it continue at the moment. I intend to come in on Friday to continue on with some of the other tasks, should be interesting to see the progress it has made by then.
@iangow just another update. Doing this
SELECT COUNT(DISTINCT file_name) FROM edgar.filing_docs WHERE file_name IN (SELECT file_name FROM edgar.filings WHERE form_type IN ('3', '4', '5', '3/A', '4/A', '5/A'))
just gave 4300390, and just a little beforehand, doing this
SELECT COUNT(*) FROM edgar.filing_docs WHERE file_name IN (SELECT file_name FROM edgar.filings WHERE form_type IN ('3', '4', '5', '3/A', '4/A', '5/A'))
gave 13132751. So 4.3 million filings have been processed, with about 3 documents per filing on average, meaning there should be around 23 million documents in edgar.filing_docs
by the time all filings have been processed.
@iangow Ok, the program has just finished, and I have run it till it couldn't process any more filings. There turned out to be 13 filings which the program couldn't process
urls345 <- unlist(lapply(fil345$file_name, get_index_url))
> urls345
[1] "https://www.sec.gov/Archives/edgar/data/1187816/000172385417000007/0001723854-17-000007-index.htm"
[2] "https://www.sec.gov/Archives/edgar/data/1437790/000143779508000004/0001437795-08-000004-index.htm"
[3] "https://www.sec.gov/Archives/edgar/data/1437793/000143779508000004/0001437795-08-000004-index.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1437794/000143779508000004/0001437795-08-000004-index.htm"
[5] "https://www.sec.gov/Archives/edgar/data/1437795/000143779508000004/0001437795-08-000004-index.htm"
[6] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031834/0001209191-18-031834-index.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031841/0001209191-18-031841-index.htm"
[8] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031846/0001209191-18-031846-index.htm"
[9] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031851/0001209191-18-031851-index.htm"
[10] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031852/0001209191-18-031852-index.htm"
[11] "https://www.sec.gov/Archives/edgar/data/1466085/000120919118031856/0001209191-18-031856-index.htm"
[12] "https://www.sec.gov/Archives/edgar/data/1716885/000172385417000007/0001723854-17-000007-index.htm"
[13] "https://www.sec.gov/Archives/edgar/data/789863/000143779508000004/0001437795-08-000004-index.htm"
Most of these give a page which states that the filing is "temporarily unavailable", but there are some others (eg. this) which lead to filings with multiple tables.
I guess we could make a small issue for these filings, and we need to make another on updating the filings continuously, but other than that I think we can consider this issue closed.
@iangow I think we can close this issue, as it was about the generic problem of scraping Form 3, 4 and 5 filings, and getting the bulk of them. We could open new issues to do an update and set up a cronjob.
https://github.com/iangow-public/edgar/blob/97810d5248b9731e399b4876797107468c45ac68/get_filing_docs.R#L51
At a simple level, this is just a matter of editing the line above to include the relevant forms and running the code. But it might be worth testing on a sample to estimate how long this will take.