Use download_filing_docs.R to create partial mirror of EDGAR

iangow commented 6 years ago

This will be (part of) a big task. For now, I think we just need to clean up the code here. I think the goal should be a function that takes a file_name and downloads all the filing documents associated with that file_name to the extent that they're not already present.

Most of the pieces are already here, so this will mostly be about cleaning up and documenting the code.

bdcallen commented 6 years ago

@iangow I found out that some of the entries in edgar.filing_docs_processed has a sizeable number of entries with downloaded = FALSE. I entered the following code today

failed_to_download <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed WHERE downloaded = FALSE"))
failed_to_download <- data.frame(failed_to_download)
head(failed_to_download)

which produced

                                    file_name        document downloaded
1 edgar/data/1004945/0001178913-14-001615.txt   zk1414915.htm      FALSE
2 edgar/data/1005531/0000950117-06-001586.txt      a41739.htm      FALSE
3 edgar/data/1005531/0000950117-06-004023.txt      ex99-1.htm      FALSE
4 edgar/data/1006224/0000899797-03-000459.txt  exhibit_99.htm      FALSE
5 edgar/data/1006224/0000899797-04-000406.txt exhibit99_1.htm      FALSE
6 edgar/data/1007023/0001062993-07-002172.txt      form6k.htm      FALSE

It seems the old code just sticks files for which the download has failed straight into edgar.filing_docs_processed. This means if we run over the file names again, these files will be ignored, since the result of get_file_list returns the anti_join of the edgar.filing_docs table and the full edgar.filing_docs_processed table, regardless of the downloaded value. Should we amend this, or deal with the false values some other way?

bdcallen commented 6 years ago

@iangow In light of this issue, I have written the function

get_filing_file_list <- function(file_name, num_files = Inf) {

    pg <- dbConnect(PostgreSQL())

    filing_docs  <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs WHERE file_name = '", file_name, "'")))
    filing_docs_processed <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", file_name, "'", " AND downloaded = TRUE")))

    get_file_path <- function(file_name, document) {
        url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
        file.path(url, document)
    }

    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
    if (!new_table) {
        files <-
            filing_docs  %>%
            anti_join(filing_docs_processed)
    } else {
        files <- filing_docs
    }

    files <-
        files %>%
        filter(document %~*% "htm$") %>%
        collect(n = num_files) %>%
        mutate(html_link = get_file_path(file_name, document))

    dbDisconnect(pg)

    return(files)
}

which finds all the filing documents for a given file_name which have yet to be downloaded, including files for which downloaded = FALSE in edgar.filing_docs_processed.

Using this function on the input "edgar/data/1084869/0001437749-14-016921.txt" yielded

# A tibble: 7 x 7
    seq description  document             type     size file_name                                   html_link                                                 
  <int> <chr>        <chr>                <chr>   <int> <chr>                                       <chr>                                                     
1     1 FORM 10-K    flws20140629_10k.htm 10-K  2832556 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/flws20140629_10k.htm
2     2 EXHIBIT 21.1 ex21-1.htm           EX-21    8406 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex21-1.htm          
3     3 EXHIBIT 23.1 ex23-1.htm           EX-23    4336 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex23-1.htm          
4     4 EXHIBIT 23.2 ex23-2.htm           EX-23    7594 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex23-2.htm          
5     5 EXHIBIT 31.1 ex31-1.htm           EX-31   15783 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex31-1.htm          
6     6 EXHIBIT 31.2 ex31-2.htm           EX-31   16288 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex31-2.htm          
7     7 EXHIBIT 32.1 ex32-1.htm           EX-32    6938 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex32-1.htm

iangow commented 6 years ago

I think the reasoning here (my reasoning, but it was some time ago) is that the filing_docs_processed table captures filings that created problems so that we don’t try to process them again. Unless we have changed the code, we will just get errors and gain nothing.

That said, if we take the anti-join of filing_docs with filing_docs_processed, we can identify problematic filings and perhaps tweak the code to attempt to process these again. Though, again, we’d want the ones that aren’t processed to be recorded in filing_docs_processed.

Does that make sense?

bdcallen commented 6 years ago

@iangow I just tested both of my functions on a random input file_name. I changed get_filing_file_list to

get_filing_file_list <- function(file_name, num_files = Inf) {

    pg <- dbConnect(PostgreSQL())

    filing_docs  <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs WHERE file_name = '", file_name, "'")))
    filing_docs_processed <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", file_name, "'")))

    get_file_path <- function(file_name, document) {
        url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
        file.path(url, document)
    }

    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
    if (!new_table) {
        files <-
            filing_docs  %>%
            anti_join(filing_docs_processed)
    } else {
        files <- filing_docs
    }

    files <-
        files %>%
        filter(document %~*% "htm$") %>%
        collect(n = num_files) %>%
        mutate(html_link = get_file_path(file_name, document))

    dbDisconnect(pg)

    return(files)
}

and also defined the function download_filing_files

download_filing_files <- function(file_name, max_files = Inf) {

    pg <- dbConnect(PostgreSQL())
    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
    dbDisconnect(pg)
    while (nrow(files <- get_filing_file_list(file_name, num_files = max_files))>0) {
        print("Getting files...")
        st <- system.time(files$downloaded <-
                              unlist(lapply(files$html_link, get_filing_docs)))

        print(sprintf("Downloaded %d files in %3.2f seconds",
                      nrow(files), st[["elapsed"]]))

        downloaded_files <-
            files %>%
            select(file_name, document, downloaded)

        pg <- dbConnect(PostgreSQL())
        dbWriteTable(pg, c("edgar", "filing_docs_processed"), downloaded_files,
                     append = !new_table,
                     row.names = FALSE)
        if (new_table) {
            dbGetQuery(pg, "CREATE INDEX ON edgar.filing_docs_processed (file_name)")
            dbGetQuery(pg, "ALTER TABLE edgar.filing_docs_processed OWNER TO edgar")
            dbGetQuery(pg, "GRANT SELECT ON TABLE edgar.filing_docs_processed TO edgar_access")
            new_table <- FALSE
        }
        dbDisconnect(pg)

}

}

To run a test, I first ran get_file_list with the command

get_file_list(num_files = 20)

and got the result

# A tibble: 20 x 7
     seq description                                                                      document                       type     size file_name       html_link        
   <int> <chr>                                                                            <chr>                          <chr>   <int> <chr>           <chr>            
 1     2 NEWS RELEASE DATED FEBRUARY 25, 2015                                             exhibit991tcc02252015.htm      EX-99.1 17693 edgar/data/123… edgar/data/12323…
 2     2 NEWS RELEASE DATED MAY 1, 2015                                                   exhibit99105012015.htm         EX-99.1 26539 edgar/data/123… edgar/data/12323…
 3     1 ""                                                                               lumina6k1222.htm               6-K     12538 edgar/data/123… edgar/data/12391…
 4     2 ""                                                                               oct2005.htm                    EX-1     8791 edgar/data/123… edgar/data/12391…
 5     4 FORM 52-109F2 - CERTIFICATION OF INTERIM FILINGS DURING TRANSITION PERIOD - PRES exhibit99-3.htm                EX-99.3  3661 edgar/data/123… edgar/data/12396…
 6     2 FORM OF PROXY                                                                    exhibit99-1.htm                EX-99.1   780 edgar/data/123… edgar/data/12396…
 7     3 MANAGEMENT DISCUSSION AND ANALYSIS                                               exhibit99-2.htm                EX-99.2 83799 edgar/data/123… edgar/data/12396…
 8     4 CERTIFICATION - CEO                                                              exhibit99-3.htm                EX-99.3  6884 edgar/data/123… edgar/data/12396…
 9     3 ""                                                                               arcelormittal6k-ex992_1204.htm EX-99.2 21397 edgar/data/124… edgar/data/12434…
10     2 ""                                                                               arcelor6kex991_0602.htm        EX-99.1  7238 edgar/data/124… edgar/data/12434…
11     2 ""                                                                               arcelor6kex991_0614.htm        EX-99.1 44524 edgar/data/124… edgar/data/12434…
12     5 EXHIBIT 5.2                                                                      d469451dex52.htm               EX-5.2   9558 edgar/data/124… edgar/data/12434…
13     1 FORM 6-K                                                                         d6k.htm                        6-K     11856 edgar/data/124… edgar/data/12434…
14     2 EXHIBIT 99.1                                                                     dex991.htm                     EX-99.1  5937 edgar/data/124… edgar/data/12434…
15     6 EXHIBIT 99.5                                                                     dex995.htm                     EX-99.5 17470 edgar/data/124… edgar/data/12434…
16     3 EXHIBIT 99.2                                                                     dex992.htm                     EX-99.2  6658 edgar/data/124… edgar/data/12434…
17     1 6-K                                                                              a2188836z6-k.htm               6-K     10803 edgar/data/125… edgar/data/12525…
18     2 EX-99.1                                                                          a08-8153_5ex99d1.htm           EX-99.1 12790 edgar/data/125… edgar/data/12525…
19     1 6-K                                                                              a08-21445_16k.htm              6-K     19774 edgar/data/125… edgar/data/12525…
20     1 6-K                                                                              a10-10243_26k.htm              6-K     20245 edgar/data/125… edgar/data/12525…

and then ran

get_file_list(num_files = 20)$file_name[1]

to get the first file_name off the list

[1] "edgar/data/1232384/0001263995-15-000014.txt"

Then, running

get_filing_file_list("edgar/data/1232384/0001263995-15-000014.txt")

I got

# A tibble: 2 x 7
    seq description                                              document                  type     size file_name                                   html_link          
  <int> <chr>                                                    <chr>                     <chr>   <int> <chr>                                       <chr>              
1     1 TRANSCANADA CORPORATION FORM 6-K DATED FEBRUARY 25, 2015 form6ktcc02252015.htm     6-K     16539 edgar/data/1232384/0001263995-15-000014.txt edgar/data/1232384…
2     2 NEWS RELEASE DATED FEBRUARY 25, 2015                     exhibit991tcc02252015.htm EX-99.1 17693 edgar/data/1232384/0001263995-15-000014.txt edgar/data/1232384…

so get_filing_file_list works. I then tested dowload_filing_files on this file

download_filing_files("edgar/data/1232384/0001263995-15-000014.txt")

obtaining the output

Joining, by = c("document", "file_name")
[1] "Getting files..."
[1] "Downloaded 2 files in 1.22 seconds"
Joining, by = c("document", "file_name")
 Error in mutate_impl(.data, dots) : 
  Evaluation error: object 'document' not found.

I then did

get_filing_file_list("edgar/data/1232384/0001263995-15-000014.txt")

Joining, by = c("document", "file_name")
 Error in mutate_impl(.data, dots) : 
  Evaluation error: object 'document' not found.

and then

get_file_list(num_files = 20)

which produced

Joining, by = c("document", "file_name")
# A tibble: 20 x 7
     seq description                                                                      document                       type     size file_name       html_link        
   <int> <chr>                                                                            <chr>                          <chr>   <int> <chr>           <chr>            
 1     2 NEWS RELEASE DATED MAY 1, 2015                                                   exhibit99105012015.htm         EX-99.1 26539 edgar/data/123… edgar/data/12323…
 2     1 ""                                                                               lumina6k1222.htm               6-K     12538 edgar/data/123… edgar/data/12391…
 3     2 ""                                                                               oct2005.htm                    EX-1     8791 edgar/data/123… edgar/data/12391…
 4     4 FORM 52-109F2 - CERTIFICATION OF INTERIM FILINGS DURING TRANSITION PERIOD - PRES exhibit99-3.htm                EX-99.3  3661 edgar/data/123… edgar/data/12396…
 5     2 FORM OF PROXY                                                                    exhibit99-1.htm                EX-99.1   780 edgar/data/123… edgar/data/12396…
 6     3 MANAGEMENT DISCUSSION AND ANALYSIS                                               exhibit99-2.htm                EX-99.2 83799 edgar/data/123… edgar/data/12396…
 7     4 CERTIFICATION - CEO                                                              exhibit99-3.htm                EX-99.3  6884 edgar/data/123… edgar/data/12396…
 8     3 ""                                                                               arcelormittal6k-ex992_1204.htm EX-99.2 21397 edgar/data/124… edgar/data/12434…
 9     2 ""                                                                               arcelor6kex991_0602.htm        EX-99.1  7238 edgar/data/124… edgar/data/12434…
10     2 ""                                                                               arcelor6kex991_0614.htm        EX-99.1 44524 edgar/data/124… edgar/data/12434…
11     5 EXHIBIT 5.2                                                                      d469451dex52.htm               EX-5.2   9558 edgar/data/124… edgar/data/12434…
12     1 FORM 6-K                                                                         d6k.htm                        6-K     11856 edgar/data/124… edgar/data/12434…
13     2 EXHIBIT 99.1                                                                     dex991.htm                     EX-99.1  5937 edgar/data/124… edgar/data/12434…
14     6 EXHIBIT 99.5                                                                     dex995.htm                     EX-99.5 17470 edgar/data/124… edgar/data/12434…
15     3 EXHIBIT 99.2                                                                     dex992.htm                     EX-99.2  6658 edgar/data/124… edgar/data/12434…
16     1 6-K                                                                              a2188836z6-k.htm               6-K     10803 edgar/data/125… edgar/data/12525…
17     2 EX-99.1                                                                          a08-8153_5ex99d1.htm           EX-99.1 12790 edgar/data/125… edgar/data/12525…
18     1 6-K                                                                              a08-21445_16k.htm              6-K     19774 edgar/data/125… edgar/data/12525…
19     1 6-K                                                                              a10-10243_26k.htm              6-K     20245 edgar/data/125… edgar/data/12525…
20     2 EX-15                                                                            l39350exv15.htm                EX-15    7371 edgar/data/125… edgar/data/12578…

Finally, I performed the SQL query

tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", "edgar/data/1232384/0001263995-15-000014.txt", "'")))

getting

# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
  file_name                                   document                  downloaded
  <chr>                                       <chr>                     <lgl>     
1 edgar/data/1232384/0001263995-15-000014.txt form6ktcc02252015.htm     TRUE      
2 edgar/data/1232384/0001263995-15-000014.txt exhibit991tcc02252015.htm TRUE

So in other words, the files associated with the test case have been put into edgar.filing_docs_processed, and thus removed from the list of files to be processed produced by get_file_list. Furthermore, both the associated filing documents have dowloaded being equal to TRUE. I also finally checked

browseURL(file.path("/home/shared", "edgar/data/1232384/000126399515000014/form6ktcc02252015.htm"))

and

browseURL(file.path("/home/shared", "edgar/data/1232384/000126399515000014/exhibit991tcc02252015.htm"))

with both working in my browser, so the documents have been successfully downloaded.

So we have a function which downloads the documents by file_name now, though as we discussed, it is probably more prudent to download by form_type. Anyway, I will commit the code now.

mccgr / edgar

Use download_filing_docs.R to create partial mirror of EDGAR #15