mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Download Item 5 #33

Closed jamespkav closed 5 years ago

jamespkav commented 5 years ago

@bdcallen , when able, can you please arrange to download all 8-Ks containing Item 5?

@iangow

bdcallen commented 5 years ago

@jamespkav @iangow

I have just downloaded the htm documents for the 8-K's and 8-K/A's containing item 5. I first amended get_item_nos.R to get the item_no for the 8-K/A's and any 8-K's for which that hadn't been done, and then ran it. Afterwards, on Friday, I ran an amended version of get_filing_docs.R which just downloads the document metadata for item 5's

#!/usr/bin/env Rscript
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL, quietly = TRUE)
library(rvest, quietly = TRUE)
source(get_filing_doc_functions.R)

pg <- dbConnect(PostgreSQL())

item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))
filing_docs <- tbl(pg, sql("SELECT file_name FROM edgar.filing_docs"))

item5_to_process <- item5 %>% anti_join(filing_docs, by = "file_name") %>% collect()

num_filings <- dim(item5_to_process)[1]
batch_size <- 200

num_batches <- ceiling(num_filings/batch_size)
total_time <- 0
num_success <- 0

for(i in 1:num_batches) {

    start <- (i - 1) * batch_size + 1

    if(i < num_batches) {

        finish <- i * batch_size
        batch <- item5_to_process[start:finish, ]

    } else {

        batch <- item5_to_process[start:num_filings, ]

    }

    time_taken <- system.time(temp <- process_filings(batch))
    num_success <- num_success + sum(temp)
    total_time <- total_time + time_taken

    print(paste0(num_success, " processed out of ", i * batch_size))
    print("Total time taken: \n")
    print(total_time)

}

Having done this, I made a version of download_filing_docs.R which used this function

get_item5_file_list <- function(num_files = Inf) {

    pg <- dbConnect(PostgreSQL())

    filing_docs  <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
    filing_docs_processed <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed"))
    item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))

    filing_docs_to_get <- filing_docs %>% inner_join(item5, by = "file_name")

    get_file_path <- function(file_name, document) {
        url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
        file.path(url, document)
    }

    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
    if (!new_table) {
        files <-
            filing_docs_to_get  %>%
            anti_join(filing_docs_processed)
    } else {
        files <- filing_docs_to_get
    }

    files <-
        files %>%
        filter(document %~*% "htm$") %>%
        collect(n = num_files) %>%
        mutate(html_link = get_file_path(file_name, document))

    dbDisconnect(pg)

    return(files)
}

to get the file lists, and using the other functions from that program to download the files from these lists. To check the progress, I made the following lazy queries

filing_docs  <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
filing_docs_processed <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed"))
item5 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '5'"))
item_5_docs_to_get <- item5 %>% inner_join(filing_docs, by = "file_name") %>% anti_join(filing_docs_processed, by = "file_name")

Doing

item_5_docs_to_get %>% filter(document %~*% "htm$") %>% count()

initially resulted in a number around 28000. After running the program, doing the same program resulted in 0, so all item 5's are in to the extent that edgar.filings is updated.

bdcallen commented 5 years ago

@iangow I tried to update edgar.filings before using get_filings.R, and I got the error message

bdcallen@igow-z640:~/edgar$ ./get_filings.R
Loading required package: methods
Error in postgresqlExecStatement(conn, statement, ...) :
  RS-DBI driver: (could not Retrieve the result : ERROR:  must be owner of relation index_last_modified_new
)
Error in postgresqlpqExec(new.con, sql4) :
  RS-DBI driver: (could not Retrieve the result : ERROR:  permission denied for relation index_last_modified_new
)
Calls: dbWriteTable ... dbWriteTable -> postgresqlWriteTable -> postgresqlpqExec -> .Call
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
  Could not create execute: DROP TABLE "index_last_modified_new"
Execution halted

It seems that we have another permissions problem here.

iangow commented 5 years ago

OK.

crsp=# ALTER TABLE edgar.index_last_modified_new OWNER TO edgar;
ALTER TABLE
crsp=# 

Please check whether we need both index_last_modified_new and index_last_modified.

jamespkav commented 5 years ago

@bdcallen @iangow thanks for helping out on this. Much appreciated.