Closed iangow closed 4 years ago
@bdcallen Please use git pull --rebase
to grab the latest code before proceeding. What I want to do is get the filing_doc
-scraping/download code working on the test sample before proceeding.
This relates to #39. We want to create a version of edgar_test.filing_docs_test
that can support downloaded.
For now, I note simply that edgar_test.filing_docs_test
has many, many duplicate rows. This needs to be fixed first.
@bdcallen I have updated the reprex
above so that it uses the current data in edgar_test.filing_docs_test
. I think the next steps are:
edgar_test.filing_docs_test
(filing_docs/scrape_filing_docs_test.R
) to remove duplicates.edgar_test.filing_docs_test
(perhaps using the "_alt_
" code) with the correct URLs. Using the first example at the bottom of the reprex
above, you may recall that the issue is that the URL for fp0051482_4.html
is not simply, https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/fp0051482_4.html
, but is instead found here to be https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
. I believe you wrote code that extracts these URLs, but I am not sure where it is.@iangow
I believe you wrote code that extracts these URLs, but I am not sure where it is.
It is in the file filing_docs/download_filing_doc_exceptions.R
. I wrote it precisely to download the exceptional cases. I wrote this and ran it when I was getting the full set of documents for Schedule 13D and 13G earlier in the year, and I have not run it since, explaining why the vast majority of cases from 2020 are not in the table.
@iangow
Please advise which days you think you can work on this this week. Please provide a daily update on the days on which you can work on it.
I will work on this for the rest of the week, or to when it is done. I will be working Wednesday, Thursday and Friday.
@iangow Sorry, I'm not following how edgar_test.filing_docs_test
was made. Could you clarify this? I'm trying to work out the reason for the duplication, I want to make sure it wasn't because of the generation process.
@iangow
crsp=# SELECT COUNT(DISTINCT(file_name, document))
crsp-# FROM edgar_test.filing_docs_test;
count
-------
983
(1 row)
crsp=# SELECT COUNT(DISTINCT(file_name))
FROM edgar_test.filing_docs_test;
count
-------
206
(1 row)
How did you choose your sample? What was your initial random sample size? Did you sample with or without replacement?
Also, where is test_sample
?
crsp=# SELECT * FROM edgar.test_sample;
ERROR: relation "edgar.test_sample" does not exist
LINE 1: SELECT * FROM edgar.test_sample;
^
crsp=# SELECT * FROM edgar_test.test_sample;
ERROR: relation "edgar_test.test_sample" does not exist
LINE 1: SELECT * FROM edgar_test.test_sample;
^
crsp=# SELECT * FROM public.test_sample;
ERROR: relation "public.test_sample" does not exist
LINE 1: SELECT * FROM public.test_sample;
And test_sample
comes from here. The idea is that test_sample
is just a grab of 1000 rows. It may be that this should really be 1,000 rows grabbed from filings
.
If test_sample
is the source of duplicates, then just tweak the code using it to eliminate them.
The test_sample
may have been deleted. So just re-create it using the code supplied (perhaps eliminating duplicates if necessary).
What I want to do is have some a test sample to ensure that our filing_docs
and download-filings code is working properly before applying that to a larger sample.
Regarding filing_docs
, we should not have some separate "exceptions" code. The main code for filing_docs
should examine the URLs for documents and populate a field with those (perhaps only when different from the URL we'd construct from file_name
and document
).
I think the following code:
test_sample <-
filing_docs_full %>%
inner_join(filings) %>%
filter(date_filed >= "2020-01-01") %>%
collect(n=1000) %>%
mutate(file_path = coalesce(path_alt, get_file_path(file_name, document))) %>%
select(file_name, document, file_path) %>%
copy_to(pg, ., name="test_sample", temporary = FALSE) %>%
collect()
probably should be
test_sample <-
filings %>%
filter(date_filed >= "2020-01-01") %>%
collect(n=1000) %>%
select(file_name) %>%
copy_to(pg, ., name="test_sample", temporary = FALSE) %>%
collect()
The downstream code using this table only uses file_name
. I'd guess that the earlier code is the source of the duplicates.
The idea is as follows:
filing_docs
code to test sample.filing_docs
.Where your work is needed is Step 2. The filing_docs
code should fetch the correct URL for all documents. I think we may want to evaluate the option of not including an entry for the URL when we can infer that from file_name
and document
. But for now it is fine if you populate that column with values even when we can so infer.
If you are in any way confused by test_sample
, put this aside for now. I can clarify and repost the issue. I would rather you focus on fixing the filing_docs
code on a clean sample rather than trying to understand where the sample comes from.
Note that /media/igow/2TB
seems to be full, so the database is down. I am moving stuff to /media/igow/6TB
to get things working again.
@iangow We could change filing_docs_df
inside scrape_filing_docs_functions.R
to this
#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Functions ----
get_index_url <- function(file_name) {
matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
cik <- matches[2]
acc_no <- matches[3]
path <- stringr::str_replace_all(acc_no, "[^\\d]", "")
url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
acc_no, "-index.htm")
return(url)
}
html_table_mod <- function(table) {
lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}
fix_names <- function(df) {
colnames(df) <- tolower(colnames(df))
df
}
filing_docs_df <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
try({
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
})
if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
return(df)
} else {
filing_doc_table_indices <-
which(table_nodes %>% html_attr("class") == "tableFile")
file_tables <- table_nodes[filing_doc_table_indices]
df <-
file_tables %>%
html_table_mod() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type),
description = as.character(description))
df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
html_attr('href') %>% paste0('https://www.sec.gov', .)
colnames(df) <- tolower(colnames(df))
}
return(df)
}
filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#> seq description document type size
#> 1 1 fp0051482_4.html 4 NA
#> 2 1 fp0051482_4.xml 4 10969
#> 3 NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#> file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#> url
#> 1 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/fp0051482_4.xml
#> 3 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/0001398344-20-005055.txt
Created on 2020-06-24 by the reprex package (v0.3.0)
We have a probably discussed this idea before (in fact, I have function filing_docs_df_with_href
which is very similar, though written slightly less elegantly). If we were to write filing_docs
over again to account for the different urls, this would be the easiest way to do it, since the hrefs are simply scraped along with the data collected by html_table
in the line I added. I guess the issue though is that we would rather not have to reprocess filing_docs
as it is a very large table (and why I went the filing_docs_alt
in the first place).
OK. Let me create a new test_sample
. Then you would tweak the code to produce a filing_docs_test
along the lines of what you have above.
I think we might be able to get away with simply adding a text
field url
to the existing table (with NULL
in all existing rows) and there would be no need to reprocess anything until we identify "errors" in `filing_docs" by trying to download filings from bad URLs.
I think we might write the code so that https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/0001398344-20-005055.txt
(third row in your test sample above) doesn't need to be stored in the table (we can infer this from file_name
and document
. It would be good to do all this on a test sample to see how much space the different options use. (My belief is that a mostly empty text
field should not use a lot of space in PostgreSQL.)
OK. This is what I did:
test_sample
by running core/create_test_sample.R
.filing_docs
by running filing_docs/scrape_filing_docs_test.R
.core/download_filing_docs_test.R
.Recall that the goal is to make sure that we can download everything available by (if necessary) tweaking the code used in Step 2. For now, just focus on getting the table working as well as possible as a one-shot run and don't worry about size of the table, etc.
I didn't run this as a reprex
, but this is what I get for the third step:
Joining, by = c("document", "file_name")
[1] "Getting files..."
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/1027596/000089418920001547/semper485bxtmarch20201.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/1027596/000089418920001547/semper485bxtmarch20201.htm': HTTP status was '503 Service Unavailable'
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/819793/000089109220002167/ain10kex10uvi.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/819793/000089109220002167/ain10kex10uvi.htm': HTTP status was '503 Service Unavailable'
Error in download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/899866/000089986620000002/jpm2020deck11jan2020fbda.htm'
In addition: Warning message:
In download.file(url = link, destfile = local_filename, quiet = TRUE) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/899866/000089986620000002/jpm2020deck11jan2020fbda.htm': HTTP status was '503 Service Unavailable'
[1] "Downloaded 3264 files in 1256.71 seconds"
Joining, by = c("document", "file_name")
To be honest, I am not sure that these don't download because the URLs did not follow the usual form. It may be some other issue. But if you (@bdcallen) could investigate and tweak the code to incorporate the fixes we discussed, that would be great.
@iangow Ok, I've renamed filing_docs_test
to filing_docs_test_old
for now, and then I ran scrape_filing_docs_test.R
, with filing_docs_df
set to what I wrote above
bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1
Error in open.connection(x, "rb") :
Timeout was reached: [www.sec.gov] Connection timed out after 10001 milliseconds
Writing data ...
76.16045 seconds
Processing batch 2
Writing data ...
68.46953 seconds
Processing batch 3
Writing data ...
69.66006 seconds
I had one html connection error, so I did the following
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test
crsp-# WHERE document IS NULL;
count
-------
1
(1 row)
crsp=# DELETE FROM edgar_test.filing_docs_test
WHERE document IS NULL;
DELETE 1
crsp=# SELECT COUNT(DISTINCT(file_name)) FROM edgar_test.filing_docs_test;
count
-------
2999
(1 row)
and then ran the program again
bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1
Writing data ...
0.8023918 seconds
and then the raw numbers looked good, with no duplicates
crsp=# SELECT COUNT(DISTINCT(file_name)) FROM edgar_test.filing_docs_test;
count
-------
3000
(1 row)
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test;
count
-------
16171
(1 row)
crsp=# SELECT COUNT(DISTINCT(file_name, document))
FROM edgar_test.filing_docs_test;
count
-------
16171
(1 row)
Also, the table has turned out with the expected form
crsp=# SELECT * FROM edgar_test.filing_docs_test LIMIT 10;
seq | description | document | type | size | file_name | url
-----+-------------------------------+--------------------------+------------+---------+--------------------------------------------+--------------------------------------------------------------------------------------------
1 | FORM 10-Q | e1679_10q.htm | 10-Q | 258311 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_10q.htm
2 | EXHIBIT 31.1 | e1679_ex311.htm | EX-31.1 | 10524 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex311.htm
3 | EXHIBIT 31.2 | e1679_ex312.htm | EX-31.2 | 10481 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex312.htm
4 | EXHIBIT 32.1 | e1679_ex321.htm | EX-32.1 | 4572 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex321.htm
5 | EXHIBIT 32.2 | e1679_ex322.htm | EX-32.2 | 4504 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/e1679_ex322.htm
| Complete submission text file | 0001731122-20-000065.txt | | 1550208 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/0001731122-20-000065.txt
6 | XBRL INSTANCE FILE | aoxy-20191231.xml | EX-101.INS | 178333 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231.xml
7 | XBRL SCHEMA FILE | aoxy-20191231.xsd | EX-101.SCH | 19864 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231.xsd
8 | XBRL CALCULATION FILE | aoxy-20191231_cal.xml | EX-101.CAL | 33901 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231_cal.xml
9 | XBRL DEFINITION FILE | aoxy-20191231_def.xml | EX-101.DEF | 74242 | edgar/data/352991/0001731122-20-000065.txt | https://www.sec.gov/Archives/edgar/data/352991/000173112220000065/aoxy-20191231_def.xml
(10 rows)
@iangow
I think we might be able to get away with simply adding a
text
fieldurl
to the existing table (withNULL
in all existing rows) and there would be no need to reprocess anything until we identify "errors" in `filing_docs" by trying to download filings from bad URLs.
Here's a version that could work with what you suggested.
#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Functions ----
get_index_url <- function(file_name) {
matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
cik <- matches[2]
acc_no <- matches[3]
path <- stringr::str_replace_all(acc_no, "[^\\d]", "")
url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
acc_no, "-index.htm")
return(url)
}
html_table_mod <- function(table) {
lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}
fix_names <- function(df) {
colnames(df) <- tolower(colnames(df))
df
}
get_filing_doc_url <- function(file_name, document) {
url <- paste('https://www.sec.gov/Archives', gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name), document, sep = '/')
return(url)
}
filing_docs_df <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
try({
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
})
if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
return(df)
} else {
filing_doc_table_indices <-
which(table_nodes %>% html_attr("class") == "tableFile")
file_tables <- table_nodes[filing_doc_table_indices]
df <-
file_tables %>%
html_table_mod() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type),
description = as.character(description))
df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
html_attr('href') %>% paste0('https://www.sec.gov', .)
norm_url <- get_filing_doc_url(df$file_name, df$document)
df$url[df$url == norm_url] <- NA
colnames(df) <- tolower(colnames(df))
}
return(df)
}
filing_docs_df('edgar/data/1000209/0001564590-20-014200.txt')
#> seq description document type size
#> 1 1 8-K/A mfin-8ka_20200220.htm 8-K/A 155056
#> 2 2 EX-99.1 mfin-ex991_52.htm EX-99.1 41478
#> 3 3 GRAPHIC gicsxsdjv1no000001.jpg GRAPHIC 1797
#> 4 4 GRAPHIC gicsxsdjv1no000003.jpg GRAPHIC 93885
#> 5 5 GRAPHIC gicsxsdjv1no000002.jpg GRAPHIC 100369
#> 6 6 GRAPHIC goy21vsscben000001.jpg GRAPHIC 75495
#> 7 NA Complete submission text file 0001564590-20-014200.txt 572198
#> file_name url
#> 1 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 2 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 3 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 4 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 5 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 6 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 7 edgar/data/1000209/0001564590-20-014200.txt <NA>
filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#> seq description document type size
#> 1 1 fp0051482_4.html 4 NA
#> 2 1 fp0051482_4.xml 4 10969
#> 3 NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#> file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#> url
#> 1 https://www.sec.gov/Archives/edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2 <NA>
#> 3 <NA>
Created on 2020-06-24 by the reprex package (v0.3.0)
So we could run over the whole set of filings_docs
, identify the ones which have a document with an alternative url, then reprocess them.
OK. The issue is that the table is almost double the size.
igow@igow-z640:~$ psql
psql (12.3 (Ubuntu 12.3-1.pgdg18.04+1), server 11.8 (Ubuntu 11.8-1.pgdg18.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.
crsp=# \dt+ edgar_test.filing_docs_*
List of relations
Schema | Name | Type | Owner | Size | Description
------------+----------------------------+-------+----------+---------+-------------
edgar_test | filing_docs_processed_test | table | edgar | 456 kB |
edgar_test | filing_docs_test | table | bdcallen | 3456 kB |
edgar_test | filing_docs_test_old | table | igow | 2016 kB |
(3 rows)
crsp=#
So we could run over the whole set of filings_docs, identify the ones which have a document with an alternative url, then reprocess them.
Not for now. Reprocessing filing_docs
would take too long.
Most of the file size increase can be eliminated by (1) eliminating url
values that can be inferred and (2) removing the extraneous elements of the URL (I kept everything after Archives
to be consistent with file_name
).
crsp=# \dt+ edgar_test.filing_docs_*
List of relations
Schema | Name | Type | Owner | Size | Description
------------+----------------------------+-------+----------+---------+-------------
edgar_test | filing_docs_processed_test | table | edgar | 456 kB |
edgar_test | filing_docs_test | table | bdcallen | 3456 kB |
edgar_test | filing_docs_test_new | table | igow | 2192 kB |
edgar_test | filing_docs_test_old | table | igow | 2016 kB |
(4 rows)
Here is code to make a "new" table out of the one you have. It would be great if you could incorporate the logic of this into the code for making the table in the first place (you might not literally use this code, of course). Then this code could become the "main" code for filing_docs
. We'd just need to add a column url
to that table and just keep adding to it. Some of the values in the existing table would be "wrong" (i.e., the URLs generated from file_name
and document
would not be valid), but we could address those when identified by deleting the associated entries (by file_name
) from filing_docs
and filing_docs_processed
, which would trigger reprocessing the next time the cron job runs.
For now, just prepare the code. Actually implementing it will be a separate issue.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar_test")
filing_docs_test <- tbl(pg, "filing_docs_test")
get_file_path <- function(file_name, document) {
url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
file.path(url, document)
}
filing_docs_test_new <-
filing_docs_test %>%
collect() %>%
mutate(url_alt = paste0("https://www.sec.gov/Archives/", get_file_path(file_name, document))) %>%
mutate(url = if_else(url == url_alt, NA_character_, url)) %>%
mutate(url = gsub("https://www.sec.gov/Archives/", "", url)) %>%
select(-url_alt) %>%
copy_to(pg, ., name = "filing_docs_test_new",
temporary = FALSE, overwrite=TRUE)
filing_docs_test_new %>%
anti_join(filing_docs_test, by = c("file_name", "document"))
#> # Source: lazy query [?? x 7]
#> # Database: postgres [igow@localhost:5432/crsp]
#> # … with 7 variables: seq <int>, description <chr>, document <chr>, type <chr>,
#> # size <int>, file_name <chr>, url <chr>
filing_docs_test %>%
anti_join(filing_docs_test_new, by = c("file_name", "document"))
#> # Source: lazy query [?? x 7]
#> # Database: postgres [igow@localhost:5432/crsp]
#> # … with 7 variables: seq <int>, description <chr>, document <chr>, type <chr>,
#> # size <int>, file_name <chr>, url <chr>
Created on 2020-06-25 by the reprex package (v0.3.0)
@iangow Yes, I anticipated the lower memory usage with the above form of the table, as each NULL entry only uses 1 byte, as opposed to 1 byte per character for a full string. The code in my latter reprex sets url
to NA for those entries that have the usual form, so adding a little piece which eliminates the 'Archives/' piece of the hyperlinks afterwards should mean that the code will generate filing_docs_test_new
from the sample.
@iangow This should do the trick
#!/usr/bin/env Rscript
library(parallel)
library(rvest)
#> Loading required package: xml2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Functions ----
get_index_url <- function(file_name) {
matches <- stringr::str_match(file_name, "/(\\d+)/(\\d{10}-\\d{2}-\\d{6})")
cik <- matches[2]
acc_no <- matches[3]
path <- stringr::str_replace_all(acc_no, "[^\\d]", "")
url <- paste0("https://www.sec.gov/Archives/edgar/data/", cik, "/", path, "/",
acc_no, "-index.htm")
return(url)
}
html_table_mod <- function(table) {
lapply(html_table(table), function(x) mutate(x, Type = as.character(Type)))
}
fix_names <- function(df) {
colnames(df) <- tolower(colnames(df))
df
}
get_filing_doc_url <- function(file_name, document) {
url <- paste('https://www.sec.gov/Archives',
gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name), document, sep = '/')
return(url)
}
filing_docs_df <- function(file_name) {
head_url <- get_index_url(file_name)
table_nodes <-
try({
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
})
if (length(table_nodes) < 1 | is(table_nodes, "try-error")) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
return(df)
} else {
filing_doc_table_indices <-
which(table_nodes %>% html_attr("class") == "tableFile")
file_tables <- table_nodes[filing_doc_table_indices]
df <-
file_tables %>%
html_table_mod() %>%
bind_rows() %>%
fix_names() %>%
mutate(file_name = file_name,
type = as.character(type),
description = as.character(description))
df$url <- file_tables %>% html_nodes(xpath = 'tr/td/a[@href]') %>%
html_attr('href') %>% stringr::str_replace('^/Archives/', '')
url_full <- paste0('https://www.sec.gov/Archives/', df$url)
norm_url <- get_filing_doc_url(df$file_name, df$document)
df$url[url_full == norm_url] <- NA
colnames(df) <- tolower(colnames(df))
}
return(df)
}
filing_docs_df('edgar/data/1000209/0001564590-20-014200.txt')
#> seq description document type size
#> 1 1 8-K/A mfin-8ka_20200220.htm 8-K/A 155056
#> 2 2 EX-99.1 mfin-ex991_52.htm EX-99.1 41478
#> 3 3 GRAPHIC gicsxsdjv1no000001.jpg GRAPHIC 1797
#> 4 4 GRAPHIC gicsxsdjv1no000003.jpg GRAPHIC 93885
#> 5 5 GRAPHIC gicsxsdjv1no000002.jpg GRAPHIC 100369
#> 6 6 GRAPHIC goy21vsscben000001.jpg GRAPHIC 75495
#> 7 NA Complete submission text file 0001564590-20-014200.txt 572198
#> file_name url
#> 1 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 2 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 3 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 4 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 5 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 6 edgar/data/1000209/0001564590-20-014200.txt <NA>
#> 7 edgar/data/1000209/0001564590-20-014200.txt <NA>
filing_docs_df('edgar/data/1000045/0001398344-20-005055.txt')
#> seq description document type size
#> 1 1 fp0051482_4.html 4 NA
#> 2 1 fp0051482_4.xml 4 10969
#> 3 NA Complete submission text file 0001398344-20-005055.txt <NA> 13503
#> file_name
#> 1 edgar/data/1000045/0001398344-20-005055.txt
#> 2 edgar/data/1000045/0001398344-20-005055.txt
#> 3 edgar/data/1000045/0001398344-20-005055.txt
#> url
#> 1 edgar/data/1000045/000139834420005055/xslF345X03/fp0051482_4.xml
#> 2 <NA>
#> 3 <NA>
Created on 2020-06-25 by the reprex package (v0.3.0)
@iangow Also, are we intending to still have a filing_docs_processed
table? Or are we going to add a downloaded
column, initialized to FALSE
for new entries, in filings_docs
? The latter could be feasibly done with some modest rewriting of the code for step 3.
This should do the trick
OK. So please run the code and check that it does do the trick, then commit and push the code.
Also, are we intending to still have a
filing_docs_processed
table?
Perhaps not. But let's finish this issue before worrying about that.
Or are we going to add a
downloaded
column, initialized toFALSE
for new entries, infilings_docs
?
I would rather not do that because, as it is, the filing_docs
table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded
column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed
table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed
table.
@iangow
I would rather not do that because, as it is, the filing_docs table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed table.
Could you supply a reference for this? This is something I didn't know about postgres.
@iangow So I dropped filing_docs_test
, then reran the code with the updated form of filing_docs_df
bdcallen@igow-z640:~/edgar$ Rscript filing_docs/scrape_filing_docs_test.R
Loading required package: xml2
Processing batch 1
Writing data ...
76.4425 seconds
Processing batch 2
Writing data ...
80.15625 seconds
Processing batch 3
Writing data ...
77.05907 seconds
and got
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test
WHERE url IS NOT NULL;
count
-------
3033
(1 row)
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test_new
WHERE url IS NOT NULL;
count
-------
3033
(1 row)
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test;
count
-------
16171
(1 row)
crsp=# SELECT COUNT(*) FROM edgar_test.filing_docs_test_new;
count
-------
16171
(1 row)
crsp=# \dt+ edgar_test.filing_docs_*
List of relations
Schema | Name | Type | Owner | Size | Description
------------+----------------------------+-------+----------+---------+-------------
edgar_test | filing_docs_processed_test | table | edgar | 456 kB |
edgar_test | filing_docs_test | table | bdcallen | 2208 kB |
edgar_test | filing_docs_test_new | table | igow | 2192 kB |
edgar_test | filing_docs_test_old | table | igow | 2016 kB |
(4 rows)
crsp=# SELECT * FROM edgar_test.filing_docs_test LIMIT 30;
seq | description | document | type | size | file_name | url
-----+-----------------------------------------------+---------------------------+-------------------+---------+---------------------------------------------+----------------------------------------------------------------------
1 | FORM 10-Q | e1679_10q.htm | 10-Q | 258311 | edgar/data/352991/0001731122-20-000065.txt |
2 | EXHIBIT 31.1 | e1679_ex311.htm | EX-31.1 | 10524 | edgar/data/352991/0001731122-20-000065.txt |
3 | EXHIBIT 31.2 | e1679_ex312.htm | EX-31.2 | 10481 | edgar/data/352991/0001731122-20-000065.txt |
4 | EXHIBIT 32.1 | e1679_ex321.htm | EX-32.1 | 4572 | edgar/data/352991/0001731122-20-000065.txt |
5 | EXHIBIT 32.2 | e1679_ex322.htm | EX-32.2 | 4504 | edgar/data/352991/0001731122-20-000065.txt |
| Complete submission text file | 0001731122-20-000065.txt | | 1550208 | edgar/data/352991/0001731122-20-000065.txt |
6 | XBRL INSTANCE FILE | aoxy-20191231.xml | EX-101.INS | 178333 | edgar/data/352991/0001731122-20-000065.txt |
7 | XBRL SCHEMA FILE | aoxy-20191231.xsd | EX-101.SCH | 19864 | edgar/data/352991/0001731122-20-000065.txt |
8 | XBRL CALCULATION FILE | aoxy-20191231_cal.xml | EX-101.CAL | 33901 | edgar/data/352991/0001731122-20-000065.txt |
9 | XBRL DEFINITION FILE | aoxy-20191231_def.xml | EX-101.DEF | 74242 | edgar/data/352991/0001731122-20-000065.txt |
10 | XBRL LABEL FILE | aoxy-20191231_lab.xml | EX-101.LAB | 146101 | edgar/data/352991/0001731122-20-000065.txt |
11 | XBRL PRESENTATION FILE | aoxy-20191231_pre.xml | EX-101.PRE | 122081 | edgar/data/352991/0001731122-20-000065.txt |
1 | | primary_doc.html | 13F-HR | | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/primary_doc.xml
1 | | primary_doc.xml | 13F-HR | 2103 | edgar/data/1802867/0001104659-20-016885.txt |
2 | | infotable.html | INFORMATION TABLE | | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/infotable.xml
2 | | infotable.xml | INFORMATION TABLE | 10071 | edgar/data/1802867/0001104659-20-016885.txt |
| Complete submission text file | 0001104659-20-016885.txt | | 13408 | edgar/data/1802867/0001104659-20-016885.txt |
1 | | AdvancedSeriesTr1231.txt | 24F-2NT | 3780 | edgar/data/814679/0000940400-20-000126.txt |
| Complete submission text file | 0000940400-20-000126.txt | | 28529 | edgar/data/814679/0000940400-20-000126.txt |
1 | ADVANCED SERIES TRUST | d855719d485bpos.htm | 485BPOS | 34442 | edgar/data/814679/0001193125-20-001237.txt |
| Complete submission text file | 0001193125-20-001237.txt | | 827013 | edgar/data/814679/0001193125-20-001237.txt |
2 | XBRL INSTANCE DOCUMENT | ast-20191217.xml | EX-101.INS | 26435 | edgar/data/814679/0001193125-20-001237.txt |
3 | XBRL TAXONOMY EXTENSION SCHEMA | ast-20191217.xsd | EX-101.SCH | 7359 | edgar/data/814679/0001193125-20-001237.txt |
4 | XBRL TAXONOMY EXTENSION CALCULATION LINKBASE | ast-20191217_cal.xml | EX-101.CAL | 4779 | edgar/data/814679/0001193125-20-001237.txt |
5 | XBRL TAXONOMY EXTENSION DEFINITION LINKBASE | ast-20191217_def.xml | EX-101.DEF | 216158 | edgar/data/814679/0001193125-20-001237.txt |
6 | XBRL TAXONOMY EXTENSION LABEL LINKBASE | ast-20191217_lab.xml | EX-101.LAB | 139058 | edgar/data/814679/0001193125-20-001237.txt |
7 | XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE | ast-20191217_pre.xml | EX-101.PRE | 166257 | edgar/data/814679/0001193125-20-001237.txt |
1 | ADVANCED STRATEGIES PM SUPPLEMENT | f2312d1.htm | 497 | 36435 | edgar/data/814679/0001683863-20-000018.txt |
2 | GRAPHIC | gu1wmk4v3t13jjhrale0a.jpg | GRAPHIC | 997 | edgar/data/814679/0001683863-20-000018.txt |
| Complete submission text file | 0001683863-20-000018.txt | | 39595 | edgar/data/814679/0001683863-20-000018.txt |
(30 rows)
crsp=# SELECT * FROM edgar_test.filing_docs_test_new LIMIT 30;
seq | description | document | type | size | file_name | url
-----+-----------------------------------------------+---------------------------+-------------------+---------+---------------------------------------------+----------------------------------------------------------------------
1 | FORM 10-Q | e1679_10q.htm | 10-Q | 258311 | edgar/data/352991/0001731122-20-000065.txt |
2 | EXHIBIT 31.1 | e1679_ex311.htm | EX-31.1 | 10524 | edgar/data/352991/0001731122-20-000065.txt |
3 | EXHIBIT 31.2 | e1679_ex312.htm | EX-31.2 | 10481 | edgar/data/352991/0001731122-20-000065.txt |
4 | EXHIBIT 32.1 | e1679_ex321.htm | EX-32.1 | 4572 | edgar/data/352991/0001731122-20-000065.txt |
5 | EXHIBIT 32.2 | e1679_ex322.htm | EX-32.2 | 4504 | edgar/data/352991/0001731122-20-000065.txt |
| Complete submission text file | 0001731122-20-000065.txt | | 1550208 | edgar/data/352991/0001731122-20-000065.txt |
6 | XBRL INSTANCE FILE | aoxy-20191231.xml | EX-101.INS | 178333 | edgar/data/352991/0001731122-20-000065.txt |
7 | XBRL SCHEMA FILE | aoxy-20191231.xsd | EX-101.SCH | 19864 | edgar/data/352991/0001731122-20-000065.txt |
8 | XBRL CALCULATION FILE | aoxy-20191231_cal.xml | EX-101.CAL | 33901 | edgar/data/352991/0001731122-20-000065.txt |
9 | XBRL DEFINITION FILE | aoxy-20191231_def.xml | EX-101.DEF | 74242 | edgar/data/352991/0001731122-20-000065.txt |
10 | XBRL LABEL FILE | aoxy-20191231_lab.xml | EX-101.LAB | 146101 | edgar/data/352991/0001731122-20-000065.txt |
11 | XBRL PRESENTATION FILE | aoxy-20191231_pre.xml | EX-101.PRE | 122081 | edgar/data/352991/0001731122-20-000065.txt |
1 | | primary_doc.html | 13F-HR | | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/primary_doc.xml
1 | | primary_doc.xml | 13F-HR | 2103 | edgar/data/1802867/0001104659-20-016885.txt |
2 | | infotable.html | INFORMATION TABLE | | edgar/data/1802867/0001104659-20-016885.txt | edgar/data/1802867/000110465920016885/xslForm13F_X01/infotable.xml
2 | | infotable.xml | INFORMATION TABLE | 10071 | edgar/data/1802867/0001104659-20-016885.txt |
| Complete submission text file | 0001104659-20-016885.txt | | 13408 | edgar/data/1802867/0001104659-20-016885.txt |
1 | | AdvancedSeriesTr1231.txt | 24F-2NT | 3780 | edgar/data/814679/0000940400-20-000126.txt |
| Complete submission text file | 0000940400-20-000126.txt | | 28529 | edgar/data/814679/0000940400-20-000126.txt |
1 | ADVANCED SERIES TRUST | d855719d485bpos.htm | 485BPOS | 34442 | edgar/data/814679/0001193125-20-001237.txt |
| Complete submission text file | 0001193125-20-001237.txt | | 827013 | edgar/data/814679/0001193125-20-001237.txt |
2 | XBRL INSTANCE DOCUMENT | ast-20191217.xml | EX-101.INS | 26435 | edgar/data/814679/0001193125-20-001237.txt |
3 | XBRL TAXONOMY EXTENSION SCHEMA | ast-20191217.xsd | EX-101.SCH | 7359 | edgar/data/814679/0001193125-20-001237.txt |
4 | XBRL TAXONOMY EXTENSION CALCULATION LINKBASE | ast-20191217_cal.xml | EX-101.CAL | 4779 | edgar/data/814679/0001193125-20-001237.txt |
5 | XBRL TAXONOMY EXTENSION DEFINITION LINKBASE | ast-20191217_def.xml | EX-101.DEF | 216158 | edgar/data/814679/0001193125-20-001237.txt |
6 | XBRL TAXONOMY EXTENSION LABEL LINKBASE | ast-20191217_lab.xml | EX-101.LAB | 139058 | edgar/data/814679/0001193125-20-001237.txt |
7 | XBRL TAXONOMY EXTENSION PRESENTATION LINKBASE | ast-20191217_pre.xml | EX-101.PRE | 166257 | edgar/data/814679/0001193125-20-001237.txt |
1 | ADVANCED STRATEGIES PM SUPPLEMENT | f2312d1.htm | 497 | 36435 | edgar/data/814679/0001683863-20-000018.txt |
2 | GRAPHIC | gu1wmk4v3t13jjhrale0a.jpg | GRAPHIC | 997 | edgar/data/814679/0001683863-20-000018.txt |
| Complete submission text file | 0001683863-20-000018.txt | | 39595 | edgar/data/814679/0001683863-20-000018.txt |
(30 rows)
So the tables are pretty much the same, as I claimed. I have no idea about the very minimal difference in memory usage, however. I will commit the code now.
@iangow
I would rather not do that because, as it is, the filing_docs table is machine-independent. So it can be backed-up and moved from one database to another. The moment you put a downloaded column in the table, the table becomes tied to the machine (and more). The reality is that the filing_docs_processed table is not very helpful, as the download code can check the file system about as quickly as PostgreSQL can check the filing_docs_processed table.
Could you supply a reference for this? This is something I didn't know about postgres.
This is not about PostgreSQL, but about EDGAR. The filing_docs
table only contains information that can be scraped from EDGAR and does not say anything about what have been downloaded to a separate machine.
For not, I have retained the idea of filing_docs_processed
even though it consumes a decent about of disk space (I guess as much as filing_docs
if we download everything, though I don't see that happening any time soon).
I will close this issue. I will create new issues for the next steps.
@bdcallen
Please prioritise this project. Please advise which days you think you can work on this this week. Please provide a daily update on the days on which you can work on it.
I have made a table comprising a sample of EDGAR filings from 2020. When I apply the download-filing code to these, most of them download. But some do not. It seems the issue is the one that
filing_docs_alt
was meant to address. However, there are no entries forfiling_docs_alt
for these entries.The problem is that nothing is currently populating
filing_docs_alt
. Please create code that populates tablesedgar_test.filing_docs
andedgar_test.filing_docs_alt
for only thefile_name
values inedgar_test.test_sample
. I will add some clarification in the next little while.Created on 2020-06-23 by the reprex package (v0.3.0)