Closed iangow closed 6 years ago
@iangow I found out that some of the entries in edgar.filing_docs_processed
has a sizeable number of entries with downloaded = FALSE
. I entered the following code today
failed_to_download <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed WHERE downloaded = FALSE"))
failed_to_download <- data.frame(failed_to_download)
head(failed_to_download)
which produced
file_name document downloaded
1 edgar/data/1004945/0001178913-14-001615.txt zk1414915.htm FALSE
2 edgar/data/1005531/0000950117-06-001586.txt a41739.htm FALSE
3 edgar/data/1005531/0000950117-06-004023.txt ex99-1.htm FALSE
4 edgar/data/1006224/0000899797-03-000459.txt exhibit_99.htm FALSE
5 edgar/data/1006224/0000899797-04-000406.txt exhibit99_1.htm FALSE
6 edgar/data/1007023/0001062993-07-002172.txt form6k.htm FALSE
It seems the old code just sticks files for which the download has failed straight into edgar.filing_docs_processed
. This means if we run over the file names again, these files will be ignored, since the result of get_file_list returns the anti_join of the edgar.filing_docs
table and the full edgar.filing_docs_processed
table, regardless of the downloaded
value. Should we amend this, or deal with the false values some other way?
@iangow In light of this issue, I have written the function
get_filing_file_list <- function(file_name, num_files = Inf) {
pg <- dbConnect(PostgreSQL())
filing_docs <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs WHERE file_name = '", file_name, "'")))
filing_docs_processed <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", file_name, "'", " AND downloaded = TRUE")))
get_file_path <- function(file_name, document) {
url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
file.path(url, document)
}
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
if (!new_table) {
files <-
filing_docs %>%
anti_join(filing_docs_processed)
} else {
files <- filing_docs
}
files <-
files %>%
filter(document %~*% "htm$") %>%
collect(n = num_files) %>%
mutate(html_link = get_file_path(file_name, document))
dbDisconnect(pg)
return(files)
}
which finds all the filing documents for a given file_name
which have yet to be downloaded, including files for which downloaded = FALSE
in edgar.filing_docs_processed
.
Using this function on the input "edgar/data/1084869/0001437749-14-016921.txt"
yielded
# A tibble: 7 x 7
seq description document type size file_name html_link
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 1 FORM 10-K flws20140629_10k.htm 10-K 2832556 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/flws20140629_10k.htm
2 2 EXHIBIT 21.1 ex21-1.htm EX-21 8406 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex21-1.htm
3 3 EXHIBIT 23.1 ex23-1.htm EX-23 4336 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex23-1.htm
4 4 EXHIBIT 23.2 ex23-2.htm EX-23 7594 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex23-2.htm
5 5 EXHIBIT 31.1 ex31-1.htm EX-31 15783 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex31-1.htm
6 6 EXHIBIT 31.2 ex31-2.htm EX-31 16288 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex31-2.htm
7 7 EXHIBIT 32.1 ex32-1.htm EX-32 6938 edgar/data/1084869/0001437749-14-016921.txt edgar/data/1084869/000143774914016921/ex32-1.htm
I think the reasoning here (my reasoning, but it was some time ago) is that the filing_docs_processed
table captures filings that created problems so that we don’t try to process them again. Unless we have changed the code, we will just get errors and gain nothing.
That said, if we take the anti-join of filing_docs
with filing_docs_processed
, we can identify problematic filings and perhaps tweak the code to attempt to process these again. Though, again, we’d want the ones that aren’t processed to be recorded in filing_docs_processed
.
Does that make sense?
@iangow I just tested both of my functions on a random input file_name
. I changed get_filing_file_list
to
get_filing_file_list <- function(file_name, num_files = Inf) {
pg <- dbConnect(PostgreSQL())
filing_docs <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs WHERE file_name = '", file_name, "'")))
filing_docs_processed <- tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", file_name, "'")))
get_file_path <- function(file_name, document) {
url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
file.path(url, document)
}
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
if (!new_table) {
files <-
filing_docs %>%
anti_join(filing_docs_processed)
} else {
files <- filing_docs
}
files <-
files %>%
filter(document %~*% "htm$") %>%
collect(n = num_files) %>%
mutate(html_link = get_file_path(file_name, document))
dbDisconnect(pg)
return(files)
}
and also defined the function download_filing_files
download_filing_files <- function(file_name, max_files = Inf) {
pg <- dbConnect(PostgreSQL())
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_processed"))
dbDisconnect(pg)
while (nrow(files <- get_filing_file_list(file_name, num_files = max_files))>0) {
print("Getting files...")
st <- system.time(files$downloaded <-
unlist(lapply(files$html_link, get_filing_docs)))
print(sprintf("Downloaded %d files in %3.2f seconds",
nrow(files), st[["elapsed"]]))
downloaded_files <-
files %>%
select(file_name, document, downloaded)
pg <- dbConnect(PostgreSQL())
dbWriteTable(pg, c("edgar", "filing_docs_processed"), downloaded_files,
append = !new_table,
row.names = FALSE)
if (new_table) {
dbGetQuery(pg, "CREATE INDEX ON edgar.filing_docs_processed (file_name)")
dbGetQuery(pg, "ALTER TABLE edgar.filing_docs_processed OWNER TO edgar")
dbGetQuery(pg, "GRANT SELECT ON TABLE edgar.filing_docs_processed TO edgar_access")
new_table <- FALSE
}
dbDisconnect(pg)
}
}
To run a test, I first ran get_file_list
with the command
get_file_list(num_files = 20)
and got the result
# A tibble: 20 x 7
seq description document type size file_name html_link
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 2 NEWS RELEASE DATED FEBRUARY 25, 2015 exhibit991tcc02252015.htm EX-99.1 17693 edgar/data/123… edgar/data/12323…
2 2 NEWS RELEASE DATED MAY 1, 2015 exhibit99105012015.htm EX-99.1 26539 edgar/data/123… edgar/data/12323…
3 1 "" lumina6k1222.htm 6-K 12538 edgar/data/123… edgar/data/12391…
4 2 "" oct2005.htm EX-1 8791 edgar/data/123… edgar/data/12391…
5 4 FORM 52-109F2 - CERTIFICATION OF INTERIM FILINGS DURING TRANSITION PERIOD - PRES exhibit99-3.htm EX-99.3 3661 edgar/data/123… edgar/data/12396…
6 2 FORM OF PROXY exhibit99-1.htm EX-99.1 780 edgar/data/123… edgar/data/12396…
7 3 MANAGEMENT DISCUSSION AND ANALYSIS exhibit99-2.htm EX-99.2 83799 edgar/data/123… edgar/data/12396…
8 4 CERTIFICATION - CEO exhibit99-3.htm EX-99.3 6884 edgar/data/123… edgar/data/12396…
9 3 "" arcelormittal6k-ex992_1204.htm EX-99.2 21397 edgar/data/124… edgar/data/12434…
10 2 "" arcelor6kex991_0602.htm EX-99.1 7238 edgar/data/124… edgar/data/12434…
11 2 "" arcelor6kex991_0614.htm EX-99.1 44524 edgar/data/124… edgar/data/12434…
12 5 EXHIBIT 5.2 d469451dex52.htm EX-5.2 9558 edgar/data/124… edgar/data/12434…
13 1 FORM 6-K d6k.htm 6-K 11856 edgar/data/124… edgar/data/12434…
14 2 EXHIBIT 99.1 dex991.htm EX-99.1 5937 edgar/data/124… edgar/data/12434…
15 6 EXHIBIT 99.5 dex995.htm EX-99.5 17470 edgar/data/124… edgar/data/12434…
16 3 EXHIBIT 99.2 dex992.htm EX-99.2 6658 edgar/data/124… edgar/data/12434…
17 1 6-K a2188836z6-k.htm 6-K 10803 edgar/data/125… edgar/data/12525…
18 2 EX-99.1 a08-8153_5ex99d1.htm EX-99.1 12790 edgar/data/125… edgar/data/12525…
19 1 6-K a08-21445_16k.htm 6-K 19774 edgar/data/125… edgar/data/12525…
20 1 6-K a10-10243_26k.htm 6-K 20245 edgar/data/125… edgar/data/12525…
and then ran
get_file_list(num_files = 20)$file_name[1]
to get the first file_name
off the list
[1] "edgar/data/1232384/0001263995-15-000014.txt"
Then, running
get_filing_file_list("edgar/data/1232384/0001263995-15-000014.txt")
I got
# A tibble: 2 x 7
seq description document type size file_name html_link
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 1 TRANSCANADA CORPORATION FORM 6-K DATED FEBRUARY 25, 2015 form6ktcc02252015.htm 6-K 16539 edgar/data/1232384/0001263995-15-000014.txt edgar/data/1232384…
2 2 NEWS RELEASE DATED FEBRUARY 25, 2015 exhibit991tcc02252015.htm EX-99.1 17693 edgar/data/1232384/0001263995-15-000014.txt edgar/data/1232384…
so get_filing_file_list works. I then tested dowload_filing_files on this file
download_filing_files("edgar/data/1232384/0001263995-15-000014.txt")
obtaining the output
Joining, by = c("document", "file_name")
[1] "Getting files..."
[1] "Downloaded 2 files in 1.22 seconds"
Joining, by = c("document", "file_name")
Error in mutate_impl(.data, dots) :
Evaluation error: object 'document' not found.
I then did
get_filing_file_list("edgar/data/1232384/0001263995-15-000014.txt")
Joining, by = c("document", "file_name")
Error in mutate_impl(.data, dots) :
Evaluation error: object 'document' not found.
and then
get_file_list(num_files = 20)
which produced
Joining, by = c("document", "file_name")
# A tibble: 20 x 7
seq description document type size file_name html_link
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 2 NEWS RELEASE DATED MAY 1, 2015 exhibit99105012015.htm EX-99.1 26539 edgar/data/123… edgar/data/12323…
2 1 "" lumina6k1222.htm 6-K 12538 edgar/data/123… edgar/data/12391…
3 2 "" oct2005.htm EX-1 8791 edgar/data/123… edgar/data/12391…
4 4 FORM 52-109F2 - CERTIFICATION OF INTERIM FILINGS DURING TRANSITION PERIOD - PRES exhibit99-3.htm EX-99.3 3661 edgar/data/123… edgar/data/12396…
5 2 FORM OF PROXY exhibit99-1.htm EX-99.1 780 edgar/data/123… edgar/data/12396…
6 3 MANAGEMENT DISCUSSION AND ANALYSIS exhibit99-2.htm EX-99.2 83799 edgar/data/123… edgar/data/12396…
7 4 CERTIFICATION - CEO exhibit99-3.htm EX-99.3 6884 edgar/data/123… edgar/data/12396…
8 3 "" arcelormittal6k-ex992_1204.htm EX-99.2 21397 edgar/data/124… edgar/data/12434…
9 2 "" arcelor6kex991_0602.htm EX-99.1 7238 edgar/data/124… edgar/data/12434…
10 2 "" arcelor6kex991_0614.htm EX-99.1 44524 edgar/data/124… edgar/data/12434…
11 5 EXHIBIT 5.2 d469451dex52.htm EX-5.2 9558 edgar/data/124… edgar/data/12434…
12 1 FORM 6-K d6k.htm 6-K 11856 edgar/data/124… edgar/data/12434…
13 2 EXHIBIT 99.1 dex991.htm EX-99.1 5937 edgar/data/124… edgar/data/12434…
14 6 EXHIBIT 99.5 dex995.htm EX-99.5 17470 edgar/data/124… edgar/data/12434…
15 3 EXHIBIT 99.2 dex992.htm EX-99.2 6658 edgar/data/124… edgar/data/12434…
16 1 6-K a2188836z6-k.htm 6-K 10803 edgar/data/125… edgar/data/12525…
17 2 EX-99.1 a08-8153_5ex99d1.htm EX-99.1 12790 edgar/data/125… edgar/data/12525…
18 1 6-K a08-21445_16k.htm 6-K 19774 edgar/data/125… edgar/data/12525…
19 1 6-K a10-10243_26k.htm 6-K 20245 edgar/data/125… edgar/data/12525…
20 2 EX-15 l39350exv15.htm EX-15 7371 edgar/data/125… edgar/data/12578…
Finally, I performed the SQL query
tbl(pg, sql(paste0("SELECT * FROM edgar.filing_docs_processed WHERE file_name = '", "edgar/data/1232384/0001263995-15-000014.txt", "'")))
getting
# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
file_name document downloaded
<chr> <chr> <lgl>
1 edgar/data/1232384/0001263995-15-000014.txt form6ktcc02252015.htm TRUE
2 edgar/data/1232384/0001263995-15-000014.txt exhibit991tcc02252015.htm TRUE
So in other words, the files associated with the test case have been put into edgar.filing_docs_processed
, and thus removed from the list of files to be processed produced by get_file_list
. Furthermore, both the associated filing documents have dowloaded
being equal to TRUE
. I also finally checked
browseURL(file.path("/home/shared", "edgar/data/1232384/000126399515000014/form6ktcc02252015.htm"))
and
browseURL(file.path("/home/shared", "edgar/data/1232384/000126399515000014/exhibit991tcc02252015.htm"))
with both working in my browser, so the documents have been successfully downloaded.
So we have a function which downloads the documents by file_name
now, though as we discussed, it is probably more prudent to download by form_type
. Anyway, I will commit the code now.
This will be (part of) a big task. For now, I think we just need to clean up the code here. I think the goal should be a function that takes a
file_name
and downloads all the filing documents associated with thatfile_name
to the extent that they're not already present.Most of the pieces are already here, so this will mostly be about cleaning up and documenting the code.