Closed jamespkav closed 5 years ago
@jamespkav @iangow
I have just downloaded the htm documents for the 8-K's and 8-K/A's containing item 5. I first amended get_item_nos.R
to get the item_no
for the 8-K/A's and any 8-K's for which that hadn't been done, and then ran it. Afterwards, on Friday, I ran an amended version of get_filing_docs.R
which just downloads the document metadata for item 5's
#!/usr/bin/env Rscript
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)
source(get_filing_doc_functions.R)
pg <- dbConnect(PostgreSQL())
item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))
filing_docs <- tbl(pg, sql("SELECT file_name FROM edgar.filing_docs"))
item5_to_process <- item5 %>% anti_join(filing_docs, by = "file_name") %>% collect()
num_filings <- dim(item5_to_process)[1]
batch_size <- 200
num_batches <- ceiling(num_filings/batch_size)
total_time <- 0
num_success <- 0
for(i in 1:num_batches) {
start <- (i - 1) * batch_size + 1
if(i < num_batches) {
finish <- i * batch_size
batch <- item5_to_process[start:finish, ]
} else {
batch <- item5_to_process[start:num_filings, ]
}
time_taken <- system.time(temp <- process_filings(batch))
num_success <- num_success + sum(temp)
total_time <- total_time + time_taken
print(paste0(num_success, " processed out of ", i * batch_size))
print("Total time taken: \n")
print(total_time)
}
Having done this, I made a version of download_filing_docs.R
which used this function
get_item5_file_list <- function(num_files = Inf) {
pg <- dbConnect(PostgreSQL())
filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
filing_docs_processed <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed"))
item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))
filing_docs_to_get <- filing_docs %>% inner_join(item5, by = "file_name")
get_file_path <- function(file_name, document) {
url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
file.path(url, document)
}
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
if (!new_table) {
files <-
filing_docs_to_get %>%
anti_join(filing_docs_processed)
} else {
files <- filing_docs_to_get
}
files <-
files %>%
filter(document %~*% "htm$") %>%
collect(n = num_files) %>%
mutate(html_link = get_file_path(file_name, document))
dbDisconnect(pg)
return(files)
}
to get the file lists, and using the other functions from that program to download the files from these lists. To check the progress, I made the following lazy queries
filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
filing_docs_processed <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed"))
item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))
item_5_docs_to_get <- item5 %>% inner_join(filing_docs, by = "file_name") %>% anti_join(filing_docs_processed, by = "file_name")
Doing
item_5_docs_to_get %>% filter(document %~*% "htm$") %>% count()
initially resulted in a number around 28000. After running the program, doing the same program resulted in 0, so all item 5's are in to the extent that edgar.filings
is updated.
@iangow I tried to update edgar.filings
before using get_filings.R
, and I got the error message
bdcallen@igow-z640:~/edgar$ ./get_filings.R
Loading required package: methods
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: must be owner of relation index_last_modified_new
)
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for relation index_last_modified_new
)
Calls: dbWriteTable ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create execute: DROP TABLE "index_last_modified_new"
Execution halted
It seems that we have another permissions problem here.
OK.
crsp=# ALTER TABLE edgar.index_last_modified_new OWNER TO edgar;
ALTER TABLE
crsp=#
Please check whether we need both index_last_modified_new
and index_last_modified
.
@bdcallen @iangow thanks for helping out on this. Much appreciated.
@bdcallen , when able, can you please arrange to download all 8-Ks containing Item 5?
@iangow