Create filing_doc scrape code for test sample

@bdcallen

Please prioritise this project. Please advise which days you think you can work on this this week. Please provide a daily update on the days on which you can work on it.

I have made a table comprising a sample of EDGAR filings from 2020. When I apply the download-filing code to these, most of them download. But some do not. It seems the issue is the one that filing_docs_alt was meant to address. However, there are no entries for filing_docs_alt for these entries.

The problem is that nothing is currently populating filing_docs_alt. Please create code that populates tables edgar_test.filing_docs and edgar_test.filing_docs_alt for only the file_name values in edgar_test.test_sample. I will add some clarification in the next little while.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(parallel)

source(file.path("https://raw.githubusercontent.com",
                 "mccgr/edgar/master/filing_docs",
                 "download_filing_docs_functions.R"))

# Code to download the subset of 8-K filings related to item number 2.05

# Get list of files to download ----
pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar_test, edgar")
filing_docs <- tbl(pg, "filing_docs_test")
filing_docs_alt <- tbl(pg, "filing_docs_alt")
filings <- tbl(pg, "filings")

filing_docs_full <-
    filing_docs %>%
    left_join(filing_docs_alt)
#> Joining, by = c("seq", "description", "document", "type", "size", "file_name")

dbExecute(pg, "DROP TABLE IF EXISTS test_sample")
#> [1] 0

test_sample <-
    filing_docs_full %>%
    inner_join(filings) %>%
    collect() %>%
    mutate(file_path = coalesce(path_alt, get_file_path(file_name, document))) %>%
    select(file_name, document, file_path) %>%
    copy_to(pg, ., name="test_sample", temporary = FALSE) %>%
    collect()
#> Joining, by = "file_name"

dbExecute(pg, "ALTER TABLE test_sample OWNER TO edgar")
#> [1] 0

dbDisconnect(pg)

# Download files from list ----
system.time(test_sample$downloaded <- unlist(mclapply(test_sample$file_path,
                                                      get_filing_docs, mc.cores = 12)))
#>    user  system elapsed 
#>   5.131   1.289  30.453

test_sample %>% count(downloaded)
#> # A tibble: 2 x 2
#>   downloaded     n
#>   <lgl>      <int>
#> 1 FALSE        254
#> 2 TRUE       26993

test_sample %>%
    filter(!downloaded) %>%
    select(file_name, document)
#> # A tibble: 254 x 2
#>    file_name                                   document        
#>    <chr>                                       <chr>           
#>  1 edgar/data/1000045/0001398344-20-005055.txt fp0051482_4.html
#>  2 edgar/data/1000045/0001398344-20-005566.txt fp0051764_4.html
#>  3 edgar/data/1000045/0001398344-20-012371.txt fp0054690_4.html
#>  4 edgar/data/1000045/0001398344-20-012371.txt fp0054690_4.html
#>  5 edgar/data/1000045/0001496701-20-000001.txt primary_doc.html
#>  6 edgar/data/1000045/0001496701-20-000002.txt primary_doc.html
#>  7 edgar/data/1000045/0001496701-20-000002.txt primary_doc.html
#>  8 edgar/data/1000045/0001794162-20-000001.txt primary_doc.html
#>  9 edgar/data/1000045/0001794162-20-000002.txt primary_doc.html
#> 10 edgar/data/1000045/0001794162-20-000002.txt primary_doc.html
#> # … with 244 more rows

^{Created on 2020-06-23 by the reprex package (v0.3.0)}

@bdcallen Please use git pull --rebase to grab the latest code before proceeding. What I want to do is get the filing_doc-scraping/download code working on the test sample before proceeding.

This relates to #39. We want to create a version of edgar_test.filing_docs_test that can support downloaded.

For now, I note simply that edgar_test.filing_docs_test has many, many duplicate rows. This needs to be fixed first.

@bdcallen I have updated the reprex above so that it uses the current data in edgar_test.filing_docs_test. I think the next steps are:

Fix code creating edgar_test.filing_docs_test (filing_docs/scrape_filing_docs_test.R) to remove duplicates.
Work out how to populate edgar_test.filing_docs_test (perhaps using the "_alt_" code) with the correct URLs. Using the first example at the bottom of the reprex above, you may recall that the issue is that the URL for fp0051482_4.html is not simply, https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/fp0051482_4.html, but is instead found here to be https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml. I believe you wrote code that extracts these URLs, but I am not sure where it is.

@iangow

I believe you wrote code that extracts these URLs, but I am not sure where it is.

It is in the file filing_docs/download_filing_doc_exceptions.R. I wrote it precisely to download the exceptional cases. I wrote this and ran it when I was getting the full set of documents for Schedule 13D and 13G earlier in the year, and I have not run it since, explaining why the vast majority of cases from 2020 are not in the table.

@iangow

Please advise which days you think you can work on this this week. Please provide a daily update on the days on which you can work on it.

I will work on this for the rest of the week, or to when it is done. I will be working Wednesday, Thursday and Friday.

@iangow Sorry, I'm not following how edgar_test.filing_docs_test was made. Could you clarify this? I'm trying to work out the reason for the duplication, I want to make sure it wasn't because of the generation process.

@iangow

crsp=# SELECT COUNT(DISTINCT(file_name, document))
crsp-# FROM edgar_test.filing_docs_test;
 count
-------
   983
(1 row)

crsp=# SELECT COUNT(DISTINCT(file_name))
FROM edgar_test.filing_docs_test;
 count
-------
   206
(1 row)

How did you choose your sample? What was your initial random sample size? Did you sample with or without replacement?

Also, where is test_sample?

crsp=# SELECT * FROM edgar.test_sample;
ERROR:  relation "edgar.test_sample" does not exist
LINE 1: SELECT * FROM edgar.test_sample;
                      ^
crsp=# SELECT * FROM edgar_test.test_sample;
ERROR:  relation "edgar_test.test_sample" does not exist
LINE 1: SELECT * FROM edgar_test.test_sample;
                      ^
crsp=# SELECT * FROM public.test_sample;
ERROR:  relation "public.test_sample" does not exist
LINE 1: SELECT * FROM public.test_sample;

And test_sample comes from here. The idea is that test_sample is just a grab of 1000 rows. It may be that this should really be 1,000 rows grabbed from filings.

If test_sample is the source of duplicates, then just tweak the code using it to eliminate them.

The test_sample may have been deleted. So just re-create it using the code supplied (perhaps eliminating duplicates if necessary).

What I want to do is have some a test sample to ensure that our filing_docs and download-filings code is working properly before applying that to a larger sample.

Regarding filing_docs, we should not have some separate "exceptions" code. The main code for filing_docs should examine the URLs for documents and populate a field with those (perhaps only when different from the URL we'd construct from file_name and document).

I think the following code:

test_sample <-
    filing_docs_full %>%
    inner_join(filings) %>%
    filter(date_filed >= "2020-01-01") %>%
    collect(n=1000) %>%
    mutate(file_path = coalesce(path_alt, get_file_path(file_name, document))) %>%
    select(file_name, document, file_path) %>%
    copy_to(pg, ., name="test_sample", temporary = FALSE) %>%
    collect()

probably should be

test_sample <-
    filings %>%
    filter(date_filed >= "2020-01-01") %>%
    collect(n=1000) %>%
    select(file_name) %>%
    copy_to(pg, ., name="test_sample", temporary = FALSE) %>%
    collect()

The downstream code using this table only uses file_name. I'd guess that the earlier code is the source of the duplicates.

The idea is as follows:

Generate test sample (1,000 filings from 2020).
Apply filing_docs code to test sample.
Apply download-filings code to test sample filing_docs.

Where your work is needed is Step 2. The filing_docs code should fetch the correct URL for all documents. I think we may want to evaluate the option of not including an entry for the URL when we can infer that from file_name and document. But for now it is fine if you populate that column with values even when we can so infer.

If you are in any way confused by test_sample, put this aside for now. I can clarify and repost the issue. I would rather you focus on fixing the filing_docs code on a clean sample rather than trying to understand where the sample comes from.

Note that /media/igow/2TB seems to be full, so the database is down. I am moving stuff to /media/igow/6TB to get things working again.

@iangow We could change filing_docs_df inside scrape_filing_docs_functions.R to this

#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Functions ----
get_index_url <- function(file_name) {
    matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
    cik <- matches[2]
    acc_no <- matches[3]
    path <- stringr::str_replace_all(acc_no, "[^\\d]", "")

    url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
                  acc_no, "-index.htm")
    return(url)
}

html_table_mod <- function(table) {
    lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}

fix_names <- function(df) {
    colnames(df) <- tolower(colnames(df))
    df
}

filing_docs_df <- function(file_name) {
    head_url <- get_index_url(file_name)

    table_nodes <-
        try({
            read_html(head_url, encoding="Latin1") %>%
                html_nodes("table")
        })

    if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
        df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                     size = NA, file_name = file_name)
        return(df)
    } else {
        filing_doc_table_indices <-
            which(table_nodes %>% html_attr("class") == "tableFile")

        file_tables <- table_nodes[filing_doc_table_indices]
        df <-
            file_tables %>%
            html_table_mod() %>%
            bind_rows() %>%
            fix_names() %>%
            mutate(file_name = file_name,
                   type = as.character(type),
                   description = as.character(description))

        df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
                    html_attr('href') %>% paste0('https://www.sec.gov', .)

        colnames(df) <- tolower(colnames(df))
    }

    return(df)
}

filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#>   seq                   description                 document type  size
#> 1   1                                       fp0051482_4.html    4    NA
#> 2   1                                        fp0051482_4.xml    4 10969
#> 3  NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#>                                     file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#>                                                                                             url
#> 1 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2            https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/fp0051482_4.xml
#> 3   https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/0001398344-20-005055.txt

^{Created on 2020-06-24 by the reprex package (v0.3.0)}

We have a probably discussed this idea before (in fact, I have function filing_docs_df_with_href which is very similar, though written slightly less elegantly). If we were to write filing_docs over again to account for the different urls, this would be the easiest way to do it, since the hrefs are simply scraped along with the data collected by html_table in the line I added. I guess the issue though is that we would rather not have to reprocess filing_docs as it is a very large table (and why I went the filing_docs_alt in the first place).

OK. Let me create a new test_sample. Then you would tweak the code to produce a filing_docs_test along the lines of what you have above.

I think we might be able to get away with simply adding a text field url to the existing table (with NULL in all existing rows) and there would be no need to reprocess anything until we identify "errors" in `filing_docs" by trying to download filings from bad URLs.

I think we might write the code so that https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/0001398344-20-005055.txt (third row in your test sample above) doesn't need to be stored in the table (we can infer this from file_name and document. It would be good to do all this on a test sample to see how much space the different options use. (My belief is that a mostly empty text field should not use a lot of space in PostgreSQL.)

OK. This is what I did:

Made test_sample by running core/create_test_sample.R.
Scraped filing_docs by running filing_docs/scrape_filing_docs_test.R.
Downloaded documents by running core/download_filing_docs_test.R.

Recall that the goal is to make sure that we can download everything available by (if necessary) tweaking the code used in Step 2. For now, just focus on getting the table working as well as possible as a one-shot run and don't worry about size of the table, etc.

I didn't run this as a reprex, but this is what I get for the third step:

Joining, by = c("document", "file_name")
[1] "Getting files..."
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) : 
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/1027596/000089418920001547/semper485bxtmarch20201.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/1027596/000089418920001547/semper485bxtmarch20201.htm': HTTP status was '503 Service Unavailable'
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) : 
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/819793/000089109220002167/ain10kex10uvi.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/819793/000089109220002167/ain10kex10uvi.htm': HTTP status was '503 Service Unavailable'
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) : 
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/899866/000089986620000002/jpm2020deck11jan2020fbda.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
  cannot open URL 'https://www.sec.gov/Archives/edgar/data/899866/000089986620000002/jpm2020deck11jan2020fbda.htm': HTTP status was '503 Service Unavailable'
[1] "Downloaded 3264 files in 1256.71 seconds"
Joining, by = c("document", "file_name")

To be honest, I am not sure that these don't download because the URLs did not follow the usual form. It may be some other issue. But if you (@bdcallen) could investigate and tweak the code to incorporate the fixes we discussed, that would be great.

@iangow Ok, I've renamed filing_docs_test to filing_docs_test_old for now, and then I ran scrape_filing_docs_test.R, with filing_docs_df set to what I wrote above

bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1 
Error in open.connection(x, "rb") : 
  Timeout was reached: [www.sec.gov] Connection timed out after 10001 milliseconds
Writing data ...
76.16045 seconds
Processing batch 2 
Writing data ...
68.46953 seconds
Processing batch 3 
Writing data ...
69.66006 seconds

I had one html connection error, so I did the following

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test
crsp-# WHERE document IS NULL;
 count
-------
     1
(1 row)

crsp=# DELETE FROM edgar_test.filing_docs_test
WHERE document IS NULL;
DELETE 1
crsp=# SELECT COUNT(DISTINCT(file_name)) FROM edgar_test.filing_docs_test;
 count
-------
  2999
(1 row)

and then ran the program again

bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1 
Writing data ...
0.8023918 seconds

and then the raw numbers looked good, with no duplicates

crsp=# SELECT COUNT(DISTINCT(file_name)) FROM edgar_test.filing_docs_test;
 count
-------
  3000
(1 row)

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test;
 count
-------
 16171
(1 row)

crsp=# SELECT COUNT(DISTINCT(file_name, document))
FROM edgar_test.filing_docs_test;
 count
-------
 16171
(1 row)

Also, the table has turned out with the expected form

crsp=# SELECT * FROM edgar_test.filing_docs_test LIMIT 10;
 seq |          description          |         document         |    type    |  size   |                 file_name                  |                                            url
-----+-------------------------------+--------------------------+------------+---------+--------------------------------------------+--------------------------------------------------------------------------------------------
   1 | FORM 10-Q                     | e1679_10q.htm            | 10-Q       |  258311 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_10q.htm
   2 | EXHIBIT 31.1                  | e1679_ex311.htm          | EX-31.1    |   10524 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex311.htm
   3 | EXHIBIT 31.2                  | e1679_ex312.htm          | EX-31.2    |   10481 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex312.htm
   4 | EXHIBIT 32.1                  | e1679_ex321.htm          | EX-32.1    |    4572 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex321.htm
   5 | EXHIBIT 32.2                  | e1679_ex322.htm          | EX-32.2    |    4504 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex322.htm
     | Complete submission text file | 0001731122-20-000065.txt |            | 1550208 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/0001731122-20-000065.txt
   6 | XBRL INSTANCE FILE            | aoxy-20191231.xml        | EX-101.INS |  178333 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231.xml
   7 | XBRL SCHEMA FILE              | aoxy-20191231.xsd        | EX-101.SCH |   19864 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231.xsd
   8 | XBRL CALCULATION FILE         | aoxy-20191231_cal.xml    | EX-101.CAL |   33901 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231_cal.xml
   9 | XBRL DEFINITION FILE          | aoxy-20191231_def.xml    | EX-101.DEF |   74242 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231_def.xml
(10 rows)

@iangow

I think we might be able to get away with simply adding a text field url to the existing table (with NULL in all existing rows) and there would be no need to reprocess anything until we identify "errors" in `filing_docs" by trying to download filings from bad URLs.

Here's a version that could work with what you suggested.

#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Functions ----
get_index_url <- function(file_name) {
    matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
    cik <- matches[2]
    acc_no <- matches[3]
    path <- stringr::str_replace_all(acc_no, "[^\\d]", "")

    url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
                  acc_no, "-index.htm")
    return(url)
}

html_table_mod <- function(table) {
    lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}

fix_names <- function(df) {
    colnames(df) <- tolower(colnames(df))
    df
}

get_filing_doc_url <- function(file_name, document) {

    url <- paste('https://www.sec.gov/Archives', gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name), document, sep = '/')

    return(url)

}

filing_docs_df <- function(file_name) {
    head_url <- get_index_url(file_name)

    table_nodes <-
        try({
            read_html(head_url, encoding="Latin1") %>%
                html_nodes("table")
        })

    if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
        df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                     size = NA, file_name = file_name)
        return(df)
    } else {
        filing_doc_table_indices <-
            which(table_nodes %>% html_attr("class") == "tableFile")

        file_tables <- table_nodes[filing_doc_table_indices]
        df <-
            file_tables %>%
            html_table_mod() %>%
            bind_rows() %>%
            fix_names() %>%
            mutate(file_name = file_name,
                   type = as.character(type),
                   description = as.character(description))

        df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
                    html_attr('href') %>% paste0('https://www.sec.gov', .)

        norm_url <- get_filing_doc_url(df$file_name, df$document)

        df$url[df$url == norm_url] <- NA

        colnames(df) <- tolower(colnames(df))
    }

    return(df)
}

filing_docs_df('edgar/data/1000209/0001564590-20-014200.txt')
#>   seq                   description                 document    type   size
#> 1   1                         8-K/A    mfin-8ka_20200220.htm   8-K/A 155056
#> 2   2                       EX-99.1        mfin-ex991_52.htm EX-99.1  41478
#> 3   3                       GRAPHIC   gicsxsdjv1no000001.jpg GRAPHIC   1797
#> 4   4                       GRAPHIC   gicsxsdjv1no000003.jpg GRAPHIC  93885
#> 5   5                       GRAPHIC   gicsxsdjv1no000002.jpg GRAPHIC 100369
#> 6   6                       GRAPHIC   goy21vsscben000001.jpg GRAPHIC  75495
#> 7  NA Complete submission text file 0001564590-20-014200.txt         572198
#>                                     file_name  url
#> 1 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 2 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 3 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 4 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 5 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 6 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 7 edgar/data/1000209/0001564590-20-014200.txt <NA>

filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#>   seq                   description                 document type  size
#> 1   1                                       fp0051482_4.html    4    NA
#> 2   1                                        fp0051482_4.xml    4 10969
#> 3  NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#>                                     file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#>                                                                                             url
#> 1 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2                                                                                          <NA>
#> 3                                                                                          <NA>

^{Created on 2020-06-24 by the reprex package (v0.3.0)}

So we could run over the whole set of filings_docs, identify the ones which have a document with an alternative url, then reprocess them.

OK. The issue is that the table is almost double the size.

igow@igow-z640:~$ psql
psql (12.3 (Ubuntu 12.3-1.pgdg18.04+1), server 11.8 (Ubuntu 11.8-1.pgdg18.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# \dt+ edgar_test.filing_docs_*
                                 List of relations
   Schema   |            Name            | Type  |  Owner   |  Size   | Description 
------------+----------------------------+-------+----------+---------+-------------
 edgar_test | filing_docs_processed_test | table | edgar    | 456 kB  | 
 edgar_test | filing_docs_test           | table | bdcallen | 3456 kB | 
 edgar_test | filing_docs_test_old       | table | igow     | 2016 kB | 
(3 rows)

crsp=#

So we could run over the whole set of filings_docs, identify the ones which have a document with an alternative url, then reprocess them.

Not for now. Reprocessing filing_docs would take too long.

Most of the file size increase can be eliminated by (1) eliminating url values that can be inferred and (2) removing the extraneous elements of the URL (I kept everything after Archives to be consistent with file_name).

crsp=# \dt+ edgar_test.filing_docs_*
                                 List of relations
   Schema   |            Name            | Type  |  Owner   |  Size   | Description 
------------+----------------------------+-------+----------+---------+-------------
 edgar_test | filing_docs_processed_test | table | edgar    | 456 kB  | 
 edgar_test | filing_docs_test           | table | bdcallen | 3456 kB | 
 edgar_test | filing_docs_test_new       | table | igow     | 2192 kB | 
 edgar_test | filing_docs_test_old       | table | igow     | 2016 kB | 
(4 rows)

Here is code to make a "new" table out of the one you have. It would be great if you could incorporate the logic of this into the code for making the table in the first place (you might not literally use this code, of course). Then this code could become the "main" code for filing_docs. We'd just need to add a column url to that table and just keep adding to it. Some of the values in the existing table would be "wrong" (i.e., the URLs generated from file_name and document would not be valid), but we could address those when identified by deleting the associated entries (by file_name) from filing_docs and filing_docs_processed, which would trigger reprocessing the next time the cron job runs.

For now, just prepare the code. Actually implementing it will be a separate issue.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar_test")

filing_docs_test <- tbl(pg, "filing_docs_test")

get_file_path <- function(file_name, document) {
    url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
    file.path(url, document)
}

filing_docs_test_new <-
    filing_docs_test %>%
    collect() %>%
    mutate(url_alt = paste0("https://www.sec.gov/Archives/", get_file_path(file_name, document))) %>%
    mutate(url = if_else(url == url_alt, NA_character_, url)) %>%
    mutate(url = gsub("https://www.sec.gov/Archives/", "", url)) %>%
    select(-url_alt) %>%
    copy_to(pg, ., name = "filing_docs_test_new",
            temporary = FALSE, overwrite=TRUE)

filing_docs_test_new %>% 
    anti_join(filing_docs_test, by = c("file_name", "document"))
#> # Source:   lazy query [?? x 7]
#> # Database: postgres [igow@localhost:5432/crsp]
#> # … with 7 variables: seq <int>, description <chr>, document <chr>, type <chr>,
#> #   size <int>, file_name <chr>, url <chr>

filing_docs_test %>% 
    anti_join(filing_docs_test_new, by = c("file_name", "document"))
#> # Source:   lazy query [?? x 7]
#> # Database: postgres [igow@localhost:5432/crsp]
#> # … with 7 variables: seq <int>, description <chr>, document <chr>, type <chr>,
#> #   size <int>, file_name <chr>, url <chr>

^{Created on 2020-06-25 by the reprex package (v0.3.0)}

@iangow Yes, I anticipated the lower memory usage with the above form of the table, as each NULL entry only uses 1 byte, as opposed to 1 byte per character for a full string. The code in my latter reprex sets url to NA for those entries that have the usual form, so adding a little piece which eliminates the 'Archives/' piece of the hyperlinks afterwards should mean that the code will generate filing_docs_test_new from the sample.

@iangow This should do the trick

#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Functions ----
get_index_url <- function(file_name) {
    matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
    cik <- matches[2]
    acc_no <- matches[3]
    path <- stringr::str_replace_all(acc_no, "[^\\d]", "")

    url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
                  acc_no, "-index.htm")
    return(url)
}

html_table_mod <- function(table) {
    lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}

fix_names <- function(df) {
    colnames(df) <- tolower(colnames(df))
    df
}

get_filing_doc_url <- function(file_name, document) {

    url <- paste('https://www.sec.gov/Archives',
                 gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name), document, sep = '/')

    return(url)

}

filing_docs_df <- function(file_name) {
    head_url <- get_index_url(file_name)

    table_nodes <-
        try({
            read_html(head_url, encoding="Latin1") %>%
                html_nodes("table")
        })

    if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
        df <- tibble(seq = NA, description = NA, document = NA, type = NA,
                     size = NA, file_name = file_name)
        return(df)
    } else {
        filing_doc_table_indices <-
            which(table_nodes %>% html_attr("class") == "tableFile")

        file_tables <- table_nodes[filing_doc_table_indices]
        df <-
            file_tables %>%
            html_table_mod() %>%
            bind_rows() %>%
            fix_names() %>%
            mutate(file_name = file_name,
                   type = as.character(type),
                   description = as.character(description))

        df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
                    html_attr('href') %>% stringr::str_replace('^/Archives/', '')

        url_full <- paste0('https://www.sec.gov/Archives/', df$url)

        norm_url <- get_filing_doc_url(df$file_name, df$document)

        df$url[url_full == norm_url] <- NA

        colnames(df) <- tolower(colnames(df))
    }

    return(df)
}

filing_docs_df('edgar/data/1000209/0001564590-20-014200.txt')
#>   seq                   description                 document    type   size
#> 1   1                         8-K/A    mfin-8ka_20200220.htm   8-K/A 155056
#> 2   2                       EX-99.1        mfin-ex991_52.htm EX-99.1  41478
#> 3   3                       GRAPHIC   gicsxsdjv1no000001.jpg GRAPHIC   1797
#> 4   4                       GRAPHIC   gicsxsdjv1no000003.jpg GRAPHIC  93885
#> 5   5                       GRAPHIC   gicsxsdjv1no000002.jpg GRAPHIC 100369
#> 6   6                       GRAPHIC   goy21vsscben000001.jpg GRAPHIC  75495
#> 7  NA Complete submission text file 0001564590-20-014200.txt         572198
#>                                     file_name  url
#> 1 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 2 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 3 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 4 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 5 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 6 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 7 edgar/data/1000209/0001564590-20-014200.txt <NA>

filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#>   seq                   description                 document type  size
#> 1   1                                       fp0051482_4.html    4    NA
#> 2   1                                        fp0051482_4.xml    4 10969
#> 3  NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#>                                     file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#>                                                                url
#> 1 edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2                                                             <NA>
#> 3                                                             <NA>

^{Created on 2020-06-25 by the reprex package (v0.3.0)}

@iangow Also, are we intending to still have a filing_docs_processed table? Or are we going to add a downloaded column, initialized to FALSE for new entries, in filings_docs? The latter could be feasibly done with some modest rewriting of the code for step 3.

This should do the trick

OK. So please run the code and check that it does do the trick, then commit and push the code.

Also, are we intending to still have a filing_docs_processed table?

Perhaps not. But let's finish this issue before worrying about that.

Or are we going to add a downloaded column, initialized to FALSE for new entries, in filings_docs?

I would rather not do that because, as it is, the filing_docs table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed table.

@iangow

I would rather not do that because, as it is, the filing_docs table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed table.

Could you supply a reference for this? This is something I didn't know about postgres.

@iangow So I dropped filing_docs_test, then reran the code with the updated form of filing_docs_df

bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1 
Writing data ...
76.4425 seconds
Processing batch 2 
Writing data ...
80.15625 seconds
Processing batch 3 
Writing data ...
77.05907 seconds

and got

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test
WHERE url IS NOT NULL;
 count
-------
  3033
(1 row)

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test_new
WHERE url IS NOT NULL;
 count
-------
  3033
(1 row)

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test;
 count
-------
 16171
(1 row)

crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test_new;
 count
-------
 16171
(1 row)

crsp=# \dt+ edgar_test.filing_docs_*
                                 List of relations
   Schema   |            Name            | Type  |  Owner   |  Size   | Description
------------+----------------------------+-------+----------+---------+-------------
 edgar_test | filing_docs_processed_test | table | edgar    | 456 kB  |
 edgar_test | filing_docs_test           | table | bdcallen | 2208 kB |
 edgar_test | filing_docs_test_new       | table | igow     | 2192 kB |
 edgar_test | filing_docs_test_old       | table | igow     | 2016 kB |
(4 rows)

crsp=# SELECT * FROM edgar_test.filing_docs_test LIMIT 30;
 seq |                  description                  |         document          |       type        |  size   |                  file_name                  |                                 url
-----+-----------------------------------------------+---------------------------+-------------------+---------+---------------------------------------------+----------------------------------------------------------------------
   1 | FORM 10-Q                                     | e1679_10q.htm             | 10-Q              |  258311 | edgar/data/352991/0001731122-20-000065.txt  |
   2 | EXHIBIT 31.1                                  | e1679_ex311.htm           | EX-31.1           |   10524 | edgar/data/352991/0001731122-20-000065.txt  |
   3 | EXHIBIT 31.2                                  | e1679_ex312.htm           | EX-31.2           |   10481 | edgar/data/352991/0001731122-20-000065.txt  |
   4 | EXHIBIT 32.1                                  | e1679_ex321.htm           | EX-32.1           |    4572 | edgar/data/352991/0001731122-20-000065.txt  |
   5 | EXHIBIT 32.2                                  | e1679_ex322.htm           | EX-32.2           |    4504 | edgar/data/352991/0001731122-20-000065.txt  |
     | Complete submission text file                 | 0001731122-20-000065.txt  |                   | 1550208 | edgar/data/352991/0001731122-20-000065.txt  |
   6 | XBRL INSTANCE FILE                            | aoxy-20191231.xml         | EX-101.INS        |  178333 | edgar/data/352991/0001731122-20-000065.txt  |
   7 | XBRL SCHEMA FILE                              | aoxy-20191231.xsd         | EX-101.SCH        |   19864 | edgar/data/352991/0001731122-20-000065.txt  |
   8 | XBRL CALCULATION FILE                         | aoxy-20191231_cal.xml     | EX-101.CAL        |   33901 | edgar/data/352991/0001731122-20-000065.txt  |
   9 | XBRL DEFINITION FILE                          | aoxy-20191231_def.xml     | EX-101.DEF        |   74242 | edgar/data/352991/0001731122-20-000065.txt  |
  10 | XBRL LABEL FILE                               | aoxy-20191231_lab.xml     | EX-101.LAB        |  146101 | edgar/data/352991/0001731122-20-000065.txt  |
  11 | XBRL PRESENTATION FILE                        | aoxy-20191231_pre.xml     | EX-101.PRE        |  122081 | edgar/data/352991/0001731122-20-000065.txt  |
   1 |                                               | primary_doc.html          | 13F-HR            |         | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/primary_doc.xml
   1 |                                               | primary_doc.xml           | 13F-HR            |    2103 | edgar/data/1802867/0001104659-20-016885.txt |
   2 |                                               | infotable.html            | INFORMATION TABLE |         | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/infotable.xml
   2 |                                               | infotable.xml             | INFORMATION TABLE |   10071 | edgar/data/1802867/0001104659-20-016885.txt |
     | Complete submission text file                 | 0001104659-20-016885.txt  |                   |   13408 | edgar/data/1802867/0001104659-20-016885.txt |
   1 |                                               | AdvancedSeriesTr1231.txt  | 24F-2NT           |    3780 | edgar/data/814679/0000940400-20-000126.txt  |
     | Complete submission text file                 | 0000940400-20-000126.txt  |                   |   28529 | edgar/data/814679/0000940400-20-000126.txt  |
   1 | ADVANCED SERIES TRUST                         | d855719d485bpos.htm       | 485BPOS           |   34442 | edgar/data/814679/0001193125-20-001237.txt  |
     | Complete submission text file                 | 0001193125-20-001237.txt  |                   |  827013 | edgar/data/814679/0001193125-20-001237.txt  |
   2 | XBRL INSTANCE DOCUMENT                        | ast-20191217.xml          | EX-101.INS        |   26435 | edgar/data/814679/0001193125-20-001237.txt  |
   3 | XBRL TAXONOMY EXTENSION SCHEMA                | ast-20191217.xsd          | EX-101.SCH        |    7359 | edgar/data/814679/0001193125-20-001237.txt  |
   4 | XBRL TAXONOMY EXTENSION CALCULATION LINKBASE  | ast-20191217_cal.xml      | EX-101.CAL        |    4779 | edgar/data/814679/0001193125-20-001237.txt  |
   5 | XBRL TAXONOMY EXTENSION DEFINITION LINKBASE   | ast-20191217_def.xml      | EX-101.DEF        |  216158 | edgar/data/814679/0001193125-20-001237.txt  |
   6 | XBRL TAXONOMY EXTENSION LABEL LINKBASE        | ast-20191217_lab.xml      | EX-101.LAB        |  139058 | edgar/data/814679/0001193125-20-001237.txt  |
   7 | XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE | ast-20191217_pre.xml      | EX-101.PRE        |  166257 | edgar/data/814679/0001193125-20-001237.txt  |
   1 | ADVANCED STRATEGIES PM SUPPLEMENT             | f2312d1.htm               | 497               |   36435 | edgar/data/814679/0001683863-20-000018.txt  |
   2 | GRAPHIC                                       | gu1wmk4v3t13jjhrale0a.jpg | GRAPHIC           |     997 | edgar/data/814679/0001683863-20-000018.txt  |
     | Complete submission text file                 | 0001683863-20-000018.txt  |                   |   39595 | edgar/data/814679/0001683863-20-000018.txt  |
(30 rows)

crsp=# SELECT * FROM edgar_test.filing_docs_test_new LIMIT 30;
 seq |                  description                  |         document          |       type        |  size   |                  file_name                  |                                 url
-----+-----------------------------------------------+---------------------------+-------------------+---------+---------------------------------------------+----------------------------------------------------------------------
   1 | FORM 10-Q                                     | e1679_10q.htm             | 10-Q              |  258311 | edgar/data/352991/0001731122-20-000065.txt  |
   2 | EXHIBIT 31.1                                  | e1679_ex311.htm           | EX-31.1           |   10524 | edgar/data/352991/0001731122-20-000065.txt  |
   3 | EXHIBIT 31.2                                  | e1679_ex312.htm           | EX-31.2           |   10481 | edgar/data/352991/0001731122-20-000065.txt  |
   4 | EXHIBIT 32.1                                  | e1679_ex321.htm           | EX-32.1           |    4572 | edgar/data/352991/0001731122-20-000065.txt  |
   5 | EXHIBIT 32.2                                  | e1679_ex322.htm           | EX-32.2           |    4504 | edgar/data/352991/0001731122-20-000065.txt  |
     | Complete submission text file                 | 0001731122-20-000065.txt  |                   | 1550208 | edgar/data/352991/0001731122-20-000065.txt  |
   6 | XBRL INSTANCE FILE                            | aoxy-20191231.xml         | EX-101.INS        |  178333 | edgar/data/352991/0001731122-20-000065.txt  |
   7 | XBRL SCHEMA FILE                              | aoxy-20191231.xsd         | EX-101.SCH        |   19864 | edgar/data/352991/0001731122-20-000065.txt  |
   8 | XBRL CALCULATION FILE                         | aoxy-20191231_cal.xml     | EX-101.CAL        |   33901 | edgar/data/352991/0001731122-20-000065.txt  |
   9 | XBRL DEFINITION FILE                          | aoxy-20191231_def.xml     | EX-101.DEF        |   74242 | edgar/data/352991/0001731122-20-000065.txt  |
  10 | XBRL LABEL FILE                               | aoxy-20191231_lab.xml     | EX-101.LAB        |  146101 | edgar/data/352991/0001731122-20-000065.txt  |
  11 | XBRL PRESENTATION FILE                        | aoxy-20191231_pre.xml     | EX-101.PRE        |  122081 | edgar/data/352991/0001731122-20-000065.txt  |
   1 |                                               | primary_doc.html          | 13F-HR            |         | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/primary_doc.xml
   1 |                                               | primary_doc.xml           | 13F-HR            |    2103 | edgar/data/1802867/0001104659-20-016885.txt |
   2 |                                               | infotable.html            | INFORMATION TABLE |         | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/infotable.xml
   2 |                                               | infotable.xml             | INFORMATION TABLE |   10071 | edgar/data/1802867/0001104659-20-016885.txt |
     | Complete submission text file                 | 0001104659-20-016885.txt  |                   |   13408 | edgar/data/1802867/0001104659-20-016885.txt |
   1 |                                               | AdvancedSeriesTr1231.txt  | 24F-2NT           |    3780 | edgar/data/814679/0000940400-20-000126.txt  |
     | Complete submission text file                 | 0000940400-20-000126.txt  |                   |   28529 | edgar/data/814679/0000940400-20-000126.txt  |
   1 | ADVANCED SERIES TRUST                         | d855719d485bpos.htm       | 485BPOS           |   34442 | edgar/data/814679/0001193125-20-001237.txt  |
     | Complete submission text file                 | 0001193125-20-001237.txt  |                   |  827013 | edgar/data/814679/0001193125-20-001237.txt  |
   2 | XBRL INSTANCE DOCUMENT                        | ast-20191217.xml          | EX-101.INS        |   26435 | edgar/data/814679/0001193125-20-001237.txt  |
   3 | XBRL TAXONOMY EXTENSION SCHEMA                | ast-20191217.xsd          | EX-101.SCH        |    7359 | edgar/data/814679/0001193125-20-001237.txt  |
   4 | XBRL TAXONOMY EXTENSION CALCULATION LINKBASE  | ast-20191217_cal.xml      | EX-101.CAL        |    4779 | edgar/data/814679/0001193125-20-001237.txt  |
   5 | XBRL TAXONOMY EXTENSION DEFINITION LINKBASE   | ast-20191217_def.xml      | EX-101.DEF        |  216158 | edgar/data/814679/0001193125-20-001237.txt  |
   6 | XBRL TAXONOMY EXTENSION LABEL LINKBASE        | ast-20191217_lab.xml      | EX-101.LAB        |  139058 | edgar/data/814679/0001193125-20-001237.txt  |
   7 | XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE | ast-20191217_pre.xml      | EX-101.PRE        |  166257 | edgar/data/814679/0001193125-20-001237.txt  |
   1 | ADVANCED STRATEGIES PM SUPPLEMENT             | f2312d1.htm               | 497               |   36435 | edgar/data/814679/0001683863-20-000018.txt  |
   2 | GRAPHIC                                       | gu1wmk4v3t13jjhrale0a.jpg | GRAPHIC           |     997 | edgar/data/814679/0001683863-20-000018.txt  |
     | Complete submission text file                 | 0001683863-20-000018.txt  |                   |   39595 | edgar/data/814679/0001683863-20-000018.txt  |
(30 rows)

So the tables are pretty much the same, as I claimed. I have no idea about the very minimal difference in memory usage, however. I will commit the code now.

@iangow

I would rather not do that because, as it is, the filing_docs table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed table.

Could you supply a reference for this? This is something I didn't know about postgres.

This is not about PostgreSQL, but about EDGAR. The filing_docs table only contains information that can be scraped from EDGAR and does not say anything about what have been downloaded to a separate machine.

For not, I have retained the idea of filing_docs_processed even though it consumes a decent about of disk space (I guess as much as filing_docs if we download everything, though I don't see that happening any time soon).

I will close this issue. I will create new issues for the next steps.

mccgr / edgar

Create filing_doc scrape code for test sample #94