Adapt download_filing_docs to deal with non-standard html document links

mccgr / edgar

Code to manage data related to SEC EDGAR

31 stars 15 forks source link

Adapt download_filing_docs to deal with non-standard html document links #38

Closed bdcallen closed 4 years ago

bdcallen commented 5 years ago

@iangow I ran the program to download the item 5's and 8's on the weekend, this time with EDGAR_DIR set correctly. All but a few hundred filings were downloaded, so I explored why these few hundred weren't getting downloaded. It turns out these filings often have html links which do not conform to the edgar/data/[filing info]/[document_name] form which is extracted by the function get_file_path, like this one.

iangow commented 5 years ago

Are you able to work out the URL for these?

bdcallen commented 5 years ago

@iangow I've added some alternative functions which handle these cases. In particular, I defined get_filing_docs_alt which extends get_filing_docs by extracting the hrefs for the documents. At the moment, I have it writing to a new table filing_docs_alt to differentiate it from filing_docs.

get_filing_docs_alt <- function(file_name) {

    try({head_url <- get_index_url(file_name)

    table_nodes <-
        read_html(head_url, encoding="Latin1") %>%
        html_nodes("table")

    if (length(table_nodes) < 1) {
        df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                     size = NA, file_name = file_name)
    } else {

        df <- table_nodes %>% html_table() %>% bind_rows() %>% fix_names() %>% mutate(file_name = file_name, type = as.character(type))

        colnames(df) <- tolower(colnames(df))

        hrefs <- table_nodes %>% html_nodes("tr") %>% html_nodes("a") %>% html_attr("href")

        hrefs <- unlist(lapply(hrefs, function(x) {paste0('https://www.sec.gov', x)}))

        df$html_link <- hrefs
    }

    pg <- dbConnect(PostgreSQL())
    dbWriteTable(pg, c("edgar", "filing_docs_alt"),
                 df, append = TRUE, row.names = FALSE)
    dbDisconnect(pg)

    return(TRUE)}, {return(FALSE)})

}

I have also defined process_filings_alt

process_filings_alt <- function(filings_df) {

    pg <- dbConnect(PostgreSQL())
    new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_alt"))

    system.time(temp <- mclapply(filings_df$file_name, get_filing_docs_alt, mc.cores = 24))

    if (new_table) {
        rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs_alt (file_name)")
        rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs_alt OWNER TO edgar")
        rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs_alt TO edgar_access")
    }

    rs <- dbDisconnect(pg)

    temp <- unlist(temp)

    return(temp)

}

iangow commented 5 years ago

@iangow I've added some alternative functions which handle these cases. In particular, I defined get_filing_docs_alt which extends get_filing_docs by extracting the hrefs for the documents. At the moment, I have it writing to a new table filing_docs_alt to differentiate it from filing_docs.

I don't think we want two sets of code and two tables. I think it would be OK to "backfill" the current table with the extra field. Note that I don't think this has to be the full URL. For example, for https://www.sec.gov/Archives/edgar/data/1041623/000104544701000021/0001045447-01-000021-0001.htm

I think 001045447-01-000021-0001.htm would suffice to allow us to generate the link (much as the current code does using the filename). I think using basename() would allow you to extract this piece.

bdcallen commented 5 years ago

@iangow

> filing_docs_alt <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_alt")) %>% collect()
> 
> exists_in_mem <- c()
> 
> for(i in 1:nrow(filing_docs_alt)) {
+     
+     
+     path <- get_file_path(filing_docs_alt$file_name[i], filing_docs_alt$document[i])
+     local_filename <- file.path(raw_directory, path)
+     exists_in_mem <- c(exists_in_mem, file.exists(local_filename))
+     
+ }
> 
> filing_docs_alt$exists_in_mem <- exists_in_mem
> 
> 
> exists_in_mem_alt <- c()
> 
> for(i in 1:nrow(filing_docs_alt)) {
+     
+     
+     
+     local_filename <- file.path(raw_directory, filing_docs_alt$html_link[i])
+     exists_in_mem_alt <- c(exists_in_mem_alt, file.exists(local_filename))
+     
+ }
> 
> filing_docs_alt$exists_in_mem_alt <- exists_in_mem_alt
> 
> 
> 
> exists_in_mem_alt2 <- c()
> 
> for(i in 1:nrow(filing_docs_alt)) {
+     
+     
+     
+     local_filename <- file.path(raw_directory, gsub("https://www.sec.gov/Archives", "", filing_docs_alt$html_link[i]))
+     exists_in_mem_alt2 <- c(exists_in_mem_alt2, file.exists(local_filename))
+     
+ }
> 
> filing_docs_alt$exists_in_mem_alt2 <- exists_in_mem_alt2
> 
> 
> sum(filing_docs_alt$exists_in_mem)
[1] 756
> 
> sum(filing_docs_alt$exists_in_mem_alt)
[1] 0
> 
> sum(filing_docs_alt$exists_in_mem_alt2)
[1] 0

bdcallen commented 5 years ago

@iangow In the above, I am figuring out which documents from filing_docs_alt actually exist in memory, as I downloaded some documents from here a few months ago. After setting EDGAR_DIR correctly to the 2 terabyte hard drive, I found that 756 documents exist in memory with the addresses of the first type counted in exists_in_mem. These memory addresses are of the same form as for those filing documents which do not exist in filing_docs_alt, ie. have the usual form for the html link and have downloaded = TRUE in filing_docs_processed.

iangow commented 5 years ago

By "memory", you mean on the HDD?

I think it might be worthwhile also to investigate how quickly we can do HTTP requests to test link validity. In R (it need not be R), this may provide some ideas.

As a start, pull 1000 documents from filing_docs, create URLs from them, then time how long it takes just to check that they're valid.

iangow commented 5 years ago

BTW, I have this line in my ~/.profile:

export EDGAR_DIR=/media/igow/2TB/

And this line in my ~/.Rprofile (this might be empty for you):

Sys.setenv(EDGAR_DIR="/media/igow/2TB/")

So these already point to the "right" place.

bdcallen commented 5 years ago

@iangow Just defined this function (using the httr package)

get_url_status_code <- function(url) {

    r <- GET(url)

    return(status_code(r))

}

then did

> system.time(stat_vec <- unlist(lapply(filing_docs_alt$html_link, get_url_status_code)))
   user  system elapsed 
 12.767   0.357  31.130 
> stat_vec
   [1] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
  [46] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
  [91] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [136] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [181] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [226] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [271] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [316] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [361] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
 [406] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [451] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [496] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [541] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [586] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [631] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [676] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [721] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [766] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [811] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [856] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [901] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [946] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
 [991] 429 429 429 429 429 429 429 429 429 429
 [ reached getOption("max.print") -- omitted 141 entries ]

So it took 31 seconds to do 1141 filing documents, though obviously there are many 429 error codes (429 is the code for Too Many Requests

iangow commented 5 years ago

OK. Using HEAD instead of GET seems to be faster. Note that things slow down once you start getting 429 codes. The following seems to run fine up to about 100 filings.

I think we could adapt the approach used in scrape_filing_docs.R to create a table of status_codes that we could populate for everything on filing_docs. Then we could delete problematic cases from filing_docs, then run a modified version of scrape_filing_docs.R again. The modified version would get the href and where this doesn't match the document, put the correct partial URL in a different table or field.

iangow commented 5 years ago

Edited code to use Vectorize:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(httr)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET work_mem = '5GB'")
rs <- dbExecute(pg, "SET search_path TO edgar, public")

filing_docs <- tbl(pg, "filing_docs") 

get_html_link <- function(file_name, document) {
    url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
    file.path("http://www.sec.gov/Archives", file.path(url, document))
}

get_status_code <- function(file_name, document) {

    url <- get_html_link(file_name, document)
    r <- HEAD(url)
    return(status_code(r))

}

get_status_code <- Vectorize(get_status_code)

system.time({
    status_codes <-
        filing_docs %>% 
        select(file_name, document) %>%
        collect(n = 100) %>%
        mutate(status_code = get_status_code(file_name, document)) 
})
#>    user  system elapsed 
#>   0.487   0.036   1.752

status_codes %>% count(status_code)
#> # A tibble: 1 x 2
#>   status_code     n
#>         <int> <int>
#> 1         200   100

^{Created on 2019-01-24 by the reprex package (v0.2.1)}

bdcallen commented 5 years ago

@iangow I don't seem to be getting the same performance with the above code. I got

> filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
> system.time({
+     status_codes <-
+         filing_docs %>% 
+         select(file_name, document) %>%
+         collect(n = 100) %>%
+         mutate(status_code = get_status_code(file_name, document)) 
+ })
   user  system elapsed 
  0.763   0.195  31.644 

> status_codes %>% count(status_code)
# A tibble: 2 x 2
  status_code     n
        <int> <int>
1         200    71
2         403    29

for a batch of 100, and

> system.time({
+     status_codes2 <-
+         filing_docs %>% 
+         select(file_name, document) %>%
+         collect(n = 1000) %>%
+         mutate(status_code = get_status_code(file_name, document)) 
+ })
   user  system elapsed 
  7.257   1.457 303.850 
> status_codes2 %>% count(status_code)
# A tibble: 2 x 2
  status_code     n
        <int> <int>
1         200   727
2         403   273

for a batch of 1000.

iangow commented 5 years ago

I am already hitting EDGAR from 10.101.13.99, so you will hit 403s sooner if you are coming from that address. Performance seems to take a hit with 403 errors (perhaps the site responds more slowly to stop DoS attacks).

iangow commented 5 years ago

So, maybe try it using RStudio on your computer.

iangow commented 5 years ago

I think there are about 60 million rows in filing_docs. So at 2 seconds per 100, I calculate:

60e6/100*2/(24*60*60) = 13.89 days.

So once we have filing_docs done, we could run this code to detect bad links that need to be replaced. No huge hurry on this (I guess 99.9% of links are good as is).

iangow commented 4 years ago

I think there are about 60 million rows in filing_docs. So at 2 seconds per 100, I calculate:

60e6/100*2/(24*60*60) = 13.89 days.

So once we have filing_docs done, we could run this code to detect bad links that need to be replaced. No huge hurry on this (I guess 99.9% of links are good as is).

@bdcallen Are we able to identify problematic filings somehow? You said above:

@iangow I ran the program to download the item 5's and 8's on the weekend, this time with EDGAR_DIR set correctly. All but a few hundred filings were downloaded, so I explored why these few hundred weren't getting downloaded. It turns out these filings often have html links which do not conform to the edgar/data/[filing info]/[document_name] form which is extracted by the function get_file_path, like this one.

How did you identify "these few hundred"?

I think the right idea would be to have a function that can handle these cases and to have a table that identifies the filings with HTML links that do not conform to the edgar/data/[filing info]/[document_name] form. With such a table it should be easy to make a virtual table that contains everything.

iangow commented 4 years ago

@bdcallen Also see #39.

bdcallen commented 4 years ago

@iangow this issue has come up again in my downloading of Schedule's 13D and 13G. About 40000 documents didn't get downloaded, and the same issue occured. Essentially, I identified these cases by looking at the entries for which the downloaded field in filing_docs_processed was False, then looking at some of these cases manually, like this one , and looking at the html code for the tables. Note that if you click on the 0001.txt document for this filing, the url you are taken to is not of the standard form. I wrote some functions which can handle and download these cases a while ago (see some of the posts above), this issue (and #39 I guess) is just a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.

iangow commented 4 years ago

I don't see 40,000 cases with downloaded equal to FALSE in filing_docs_processed:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET work_mem='3GB'")
rs <- dbExecute(pg, "SET search_path TO edgar")

filing_docs_processed <- tbl(pg, "filing_docs_processed")
problems <-
    filing_docs_processed %>% 
    filter(!downloaded)

problems %>%
    count(document) %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>    document                 n      
#>    <chr>                    <int64>
#>  1 0001.txt                 398    
#>  2 0002.txt                  85    
#>  3 0003.txt                  37    
#>  4 0004.txt                  16    
#>  5 0005.txt                  10    
#>  6 0006.txt                   4    
#>  7 0007.txt                   3    
#>  8 mbot_sc13d.txt             2    
#>  9 0001172661-18-000468.txt   2    
#> 10 0001748828-18-000002.txt   2    
#> # … with more rows

problems %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 566

^{Created on 2020-01-07 by the reprex package (v0.3.0)}

If all the problem cases can be flagged by looking at filing_docs_processed, then one approach would be to write a small function that takes the problem cases, figures out the correct URL (could be extracted from the link on the HTML index page), and downloads from there. I think if we thereby created a table filing_docs_non_standard with file_name, document, and (correct!)url we would have all we need. I think the downloaded file should be stored in (say) edgar/data/921030/000092103000000039/0000921030-00-000039-0001.txt so that our file structure mimics that of SEC.

iangow commented 4 years ago

… a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.

I think adding a column would be redundant in the overwhelming majority of cases. If you can make the "filing_docs_non_standard" table above, we can make a view that has the additional column easily.

bdcallen commented 4 years ago

@iangow Sorry, I deleted the 40000 cases where downloaded is false, as I wanted to run the program over that set again. Seeing as the majority were not getting downloaded, I halted the program, so that's why there are only few hundred now.

bdcallen commented 4 years ago

… a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.

I think adding a column would be redundant in the overwhelming majority of cases. If you can make the "filing_docs_non_standard" table above, we can make a view that has the additional column easily.

I agree, an additional table is preferable since for the vast majority of cases the information contained in an additional column would be redundant. I actually made a couple of tables, filing_docs_alt and edgar.filing_docs_processed_alt

crsp=> SELECT * FROM edgar.filing_docs_processed_alt LIMIT 10;
                  file_name                  | document | downloaded 
---------------------------------------------+----------+------------
 edgar/data/1021949/0000950128-00-000880.txt | 0001.htm | t
 edgar/data/1028262/0000897101-00-001194.txt | 0001.htm | t
 edgar/data/1028262/0000897101-00-001194.txt | 0002.htm | t
 edgar/data/1012770/0000950128-00-000892.txt | 0001.htm | t
 edgar/data/1003130/0000906318-00-000064.txt | 0001.htm | t
 edgar/data/1007797/0001068238-01-000055.txt | 0001.htm | t
 edgar/data/1021848/0000899681-00-000451.txt | 0001.htm | t
 edgar/data/1021848/0000899681-00-000451.txt | 0002.htm | t
 edgar/data/1021848/0000899681-01-000008.txt | 0001.htm | t
 edgar/data/1021848/0000899681-01-000008.txt | 0002.htm | t
(10 rows)

crsp=> SELECT * FROM edgar.filing_docs_alt LIMIT 10;
 seq |                description                |         document         |  type   | size  |                  file_name                  |                                            html_link                                             
-----+-------------------------------------------+--------------------------+---------+-------+---------------------------------------------+--------------------------------------------------------------------------------------------------
   1 | MELLON PREMIUM FINANCE LOAN MASTER TRUST  | 0001.htm                 | 8-K     | 17475 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880-0001.htm
   2 | UPDATED TABLES                            | 0002.txt                 | EX-19.1 | 15430 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880-0002.txt
     | Complete submission text file             | 0000950128-00-000880.txt |         | 34801 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880.txt
   1 |                                           | 0001.htm                 | 8-K     |  6201 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194-0001.htm
   2 | PRESS RELEASE                             | 0002.htm                 | EX-99.1 |  3736 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194-0002.htm
     | Complete submission text file             | 0000897101-00-001194.txt |         | 11501 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194.txt
   1 | MELLON BANK HOME EQUITY LOAN TRUST 1996-1 | 0001.htm                 | 8-K     |  6729 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892-0001.htm
   2 | MONTHLY SERVICER & INVESTOR REPORT        | 0002.txt                 | EX-20   | 51932 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892-0002.txt
     | Complete submission text file             | 0000950128-00-000892.txt |         | 60507 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892.txt
   1 |                                           | 0001.htm                 | 8-K     |  6952 | edgar/data/1003130/0000906318-00-000064.txt | https://www.sec.gov/Archives/edgar/data/1003130/000090631800000064/0000906318-00-000064-0001.htm

which were to test this idea when I was helping James with downloading the 8-K's (which was a while ago). Furthermore, it seems I also included code for this job in the file filing_docs/download_filing_doc_exceptions.R.

bdcallen commented 4 years ago

@iangow I have amended filing_docs/download_filing_doc_exceptions.R in the commit above, so that it writes to a new form I've decided on for filing_docs_alt. I have deleted the old filing_docs_alt. The new filing_docs_alt is intended to have a column path_alt, which contains the path to the downloaded file as well as being the html link with the https://www.sec.gov/Archives/ stem removed. It also contains its own downloaded column, which is amended to true if the files are downloaded successfully. I'm running the program now, seems to be mostly doing the right thing (despite around 200 duplicates it did at the start of the running, which it is not doing now). It should take around 6.5 hours to run. I will analyze the table and results tomorrow.

bdcallen commented 4 years ago

@iangow After fixing some errors with the table joins download_filing_doc_exceptions.R, which was leading to unnecessary duplication in filing_docs_alt, I got rid of the duplication and then ran the program. It has run successfully, with just a handful of documents not being processed

crsp=# SELECT COUNT(*) FROM edgar.filing_docs_alt;
 count 
-------
 47799
(1 row)

crsp=# SELECT COUNT(*) FROM edgar.filing_docs_processed 
crsp-# WHERE NOT downloaded;
 count 
-------
 47807
(1 row)

bdcallen commented 4 years ago

Furthermore, doing

failed_to_download <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed WHERE NOT downloaded"))
filing_docs_alt <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_alt"))
files <-
        failed_to_download %>%
        anti_join(filing_docs_alt, by = c("file_name", "document")) %>%
        filter(document %~*% "txt$") %>%
        collect()

showed that these eight documents are

> files
# A tibble: 8 x 3
  file_name                                   document                 downloaded
* <chr>                                       <chr>                    <lgl>     
1 edgar/data/1748828/0001748828-18-000002.txt mbot_sc13d.txt           FALSE     
2 edgar/data/1748828/0001748828-18-000002.txt 0001748828-18-000002.txt FALSE     
3 edgar/data/883975/0001748828-18-000002.txt  mbot_sc13d.txt           FALSE     
4 edgar/data/883975/0001748828-18-000002.txt  0001748828-18-000002.txt FALSE     
5 edgar/data/1094742/0001172661-17-000448.txt 0001172661-17-000448.txt FALSE     
6 edgar/data/1332905/0001172661-17-000448.txt 0001172661-17-000448.txt FALSE     
7 edgar/data/1094742/0001172661-18-000468.txt 0001172661-18-000468.txt FALSE     
8 edgar/data/1332905/0001172661-18-000468.txt 0001172661-18-000468.txt FALSE

Having had a look at the pages of the filings involved, it seems these are filings which are not on the system anymore (ie. they lead to the temporarily unavailable page)

bdcallen commented 4 years ago

@iangow Also, the downloading was very successful, there are only 24 entries for which downloaded is false

crsp=# SELECT * FROM edgar.filing_docs_alt WHERE NOT downloaded;
                  file_name                  |         document         | downloaded | seq |          description          |   type   |  size  |                            path_alt                            
---------------------------------------------+--------------------------+------------+-----+-------------------------------+----------+--------+----------------------------------------------------------------
 edgar/data/919549/0000065103-95-000120.txt  | 0000065103-95-000120.txt | f          |     | Complete submission text file |          |      0 | edgar/data/65100/0000065103-95-000120.txt
 edgar/data/65100/0000065103-95-000120.txt   | 0000065103-95-000120.txt | f          |     | Complete submission text file |          |      0 | edgar/data/65100/0000065103-95-000120.txt
 edgar/data/909465/0000909465-95-000005.txt  | 0000909465-95-000005.txt | f          |     | Complete submission text file |          | 127586 | edgar/data/789625/0000909465-95-000005.txt
 edgar/data/789625/0000909465-95-000005.txt  | 0000909465-95-000005.txt | f          |     | Complete submission text file |          | 127586 | edgar/data/789625/0000909465-95-000005.txt
 edgar/data/887777/0000887777-03-000003.txt  | phar2-2003.txt           | f          |   1 |                               | SC 13G   |   8573 | edgar/data/887777/000088777703000003/phar2-2003.txt
 edgar/data/1072546/0000887777-03-000003.txt | phar2-2003.txt           | f          |   1 |                               | SC 13G   |   8573 | edgar/data/887777/000088777703000003/phar2-2003.txt
 edgar/data/1256394/0001256394-03-000004.txt | her2003.txt              | f          |   1 |                               | SC 13G   |   7014 | edgar/data/1021604/000125639403000004/her2003.txt
 edgar/data/1021604/0001256394-03-000004.txt | her2003.txt              | f          |   1 |                               | SC 13G   |   7014 | edgar/data/1021604/000125639403000004/her2003.txt
 edgar/data/1063085/0001172661-03-000034.txt | westcap1103.txt          | f          |   1 | FORM 13G HOLDING REPORT       | SC 13G   |    519 | edgar/data/1010614/000117266103000034/westcap1103.txt
 edgar/data/1010614/0001172661-03-000034.txt | westcap1103.txt          | f          |   1 | FORM 13G HOLDING REPORT       | SC 13G   |    519 | edgar/data/1010614/000117266103000034/westcap1103.txt
 edgar/data/1010614/0001172661-03-000035.txt | westcap11603.txt         | f          |   1 | FORM 13G HOLDINGS REPORT      | SC 13G   |    512 | edgar/data/722392/000117266103000035/westcap11603.txt
 edgar/data/1010614/0001172661-03-000050.txt | westcap1203.txt          | f          |   1 | FORM 13G FILING               | SC 13G   |    527 | edgar/data/870826/000117266103000050/westcap1203.txt
 edgar/data/1013149/0001172661-04-000001.txt | westcap0104.txt          | f          |   1 | FORM 13G HOLDINGS REPORT      | SC 13G   |    507 | edgar/data/1010614/000117266104000001/westcap0104.txt
 edgar/data/722392/0001172661-03-000035.txt  | westcap11603.txt         | f          |   1 | FORM 13G HOLDINGS REPORT      | SC 13G   |    512 | edgar/data/722392/000117266103000035/westcap11603.txt
 edgar/data/870826/0001172661-03-000050.txt  | westcap1203.txt          | f          |   1 | FORM 13G FILING               | SC 13G   |    527 | edgar/data/870826/000117266103000050/westcap1203.txt
 edgar/data/1010614/0001172661-04-000001.txt | westcap0104.txt          | f          |   1 | FORM 13G HOLDINGS REPORT      | SC 13G   |    507 | edgar/data/1010614/000117266104000001/westcap0104.txt
 edgar/data/1072006/0001072006-01-500007.txt | form13da-091701.txt      | f          |   1 | BLUE LAKE 13D/A               | SC 13D/A |   8231 | edgar/data/1000285/000107200601500007/form13da-091701.txt
 edgar/data/1072006/0001072006-01-500007.txt | 0001072006-01-500007.txt | f          |     | Complete submission text file |          |   9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
 edgar/data/1000285/0001072006-01-500007.txt | form13da-091701.txt      | f          |   1 | BLUE LAKE 13D/A               | SC 13D/A |   8231 | edgar/data/1000285/000107200601500007/form13da-091701.txt
 edgar/data/1000285/0001072006-01-500007.txt | 0001072006-01-500007.txt | f          |     | Complete submission text file |          |   9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
 edgar/data/37785/0000940180-01-500261.txt   | dsc13d.txt               | f          |   1 | SCHEDULE 13D                  | SC 13D   |  31868 | edgar/data/37785/000094018001500261/dsc13d.txt
 edgar/data/37785/0000940180-01-500261.txt   | 0000940180-01-500261.txt | f          |     | Complete submission text file |          |  33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
 edgar/data/906193/0000940180-01-500261.txt  | dsc13d.txt               | f          |   1 | SCHEDULE 13D                  | SC 13D   |  31868 | edgar/data/37785/000094018001500261/dsc13d.txt
 edgar/data/906193/0000940180-01-500261.txt  | 0000940180-01-500261.txt | f          |     | Complete submission text file |          |  33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
(24 rows)

I have checked these documents by path_alt, and they all lead to pages like this one, so they seem to be cases that genuinely should have downloaded set to false.

bdcallen commented 4 years ago

@iangow

crsp=# SELECT * FROM edgar.filing_docs_alt
WHERE path_alt ~ 'edgar/data/[0-9]+/[0-9]+/[0-9]{10}-[0-9]{2}-[0-9]{6}\.txt';
                  file_name                  |         document         | downloaded | seq |          description          | type |  size  |                            path_alt                            
---------------------------------------------+--------------------------+------------+-----+-------------------------------+------+--------+----------------------------------------------------------------
 edgar/data/1064122/0000806085-08-000112.txt | 0000806085-08-000112.txt | t          |     | Complete submission text file |      | 138677 | edgar/data/806085/000080608508000112/0000806085-08-000112.txt
 edgar/data/1072006/0001072006-01-500007.txt | 0001072006-01-500007.txt | f          |     | Complete submission text file |      |   9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
 edgar/data/1000285/0001072006-01-500007.txt | 0001072006-01-500007.txt | f          |     | Complete submission text file |      |   9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
 edgar/data/37785/0000940180-01-500261.txt   | 0000940180-01-500261.txt | f          |     | Complete submission text file |      |  33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
 edgar/data/906193/0000940180-01-500261.txt  | 0000940180-01-500261.txt | f          |     | Complete submission text file |      |  33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
(5 rows)

So there are 5 entries in which the path_alt has the usual form for the link to the document, with only these two

                  file_name                  |         document         | downloaded | seq |          description          | type |  size  |                            path_alt                            
---------------------------------------------+--------------------------+------------+-----+-------------------------------+------+--------+----------------------------------------------------------------
edgar/data/1000285/0001072006-01-500007.txt  | 0001072006-01-500007.txt | f          |     | Complete submission text file |      |   9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/37785/0000940180-01-500261.txt    | 0000940180-01-500261.txt | f          |     | Complete submission text file |      |  33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt

match the usual form derived given their file_name and document. It's worth mentioning that these two failed to be downloaded. So there is the possibility that cases with the usual form get written into filing_docs_alt. It seems, however, that these cases are going to be ones in which there is no document to be downloaded from the link, that there are relatively few of them, and that these cases just correspond to a reprocessing from the program which constructs filing_docs_alt. If we can accept this, I think we can close this issue. Otherwise, I could do one last commit to make sure these cases don't appear in the table.

iangow commented 4 years ago

@bdcallen I agree that this is fine.