Closed bdcallen closed 4 years ago
Are you able to work out the URL for these?
@iangow I've added some alternative functions which handle these cases. In particular, I defined get_filing_docs_alt
which extends get_filing_docs
by extracting the hrefs for the documents. At the moment, I have it writing to a new table filing_docs_alt
to differentiate it from filing_docs
.
get_filing_docs_alt <- function(file_name) {
try({head_url <- get_index_url(file_name)
table_nodes <-
read_html(head_url, encoding="Latin1") %>%
html_nodes("table")
if (length(table_nodes) < 1) {
df <- tibble(seq = NA, description = NA, document = NA, type = NA,
size = NA, file_name = file_name)
} else {
df <- table_nodes %>% html_table() %>% bind_rows() %>% fix_names() %>% mutate(file_name = file_name, type = as.character(type))
colnames(df) <- tolower(colnames(df))
hrefs <- table_nodes %>% html_nodes("tr") %>% html_nodes("a") %>% html_attr("href")
hrefs <- unlist(lapply(hrefs, function(x) {paste0('https://www.sec.gov', x)}))
df$html_link <- hrefs
}
pg <- dbConnect(PostgreSQL())
dbWriteTable(pg, c("edgar", "filing_docs_alt"),
df, append = TRUE, row.names = FALSE)
dbDisconnect(pg)
return(TRUE)}, {return(FALSE)})
}
I have also defined process_filings_alt
process_filings_alt <- function(filings_df) {
pg <- dbConnect(PostgreSQL())
new_table <- !dbExistsTable(pg, c("edgar", "filing_docs_alt"))
system.time(temp <- mclapply(filings_df$file_name, get_filing_docs_alt, mc.cores = 24))
if (new_table) {
rs <- dbExecute(pg, "CREATE INDEX ON edgar.filing_docs_alt (file_name)")
rs <- dbExecute(pg, "ALTER TABLE edgar.filing_docs_alt OWNER TO edgar")
rs <- dbExecute(pg, "GRANT SELECT ON TABLE edgar.filing_docs_alt TO edgar_access")
}
rs <- dbDisconnect(pg)
temp <- unlist(temp)
return(temp)
}
@iangow I've added some alternative functions which handle these cases. In particular, I defined
get_filing_docs_alt
which extendsget_filing_docs
by extracting the hrefs for the documents. At the moment, I have it writing to a new tablefiling_docs_alt
to differentiate it fromfiling_docs
.
I don't think we want two sets of code and two tables. I think it would be OK to "backfill" the current table with the extra field. Note that I don't think this has to be the full URL. For example, for
https://www.sec.gov/Archives/edgar/data/1041623/000104544701000021/0001045447-01-000021-0001.htm
I think 001045447-01-000021-0001.htm
would suffice to allow us to generate the link (much as the current code does using the filename). I think using basename()
would allow you to extract this piece.
@iangow
> filing_docs_alt <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_alt")) %>% collect()
>
> exists_in_mem <- c()
>
> for(i in 1:nrow(filing_docs_alt)) {
+
+
+ path <- get_file_path(filing_docs_alt$file_name[i], filing_docs_alt$document[i])
+ local_filename <- file.path(raw_directory, path)
+ exists_in_mem <- c(exists_in_mem, file.exists(local_filename))
+
+ }
>
> filing_docs_alt$exists_in_mem <- exists_in_mem
>
>
> exists_in_mem_alt <- c()
>
> for(i in 1:nrow(filing_docs_alt)) {
+
+
+
+ local_filename <- file.path(raw_directory, filing_docs_alt$html_link[i])
+ exists_in_mem_alt <- c(exists_in_mem_alt, file.exists(local_filename))
+
+ }
>
> filing_docs_alt$exists_in_mem_alt <- exists_in_mem_alt
>
>
>
> exists_in_mem_alt2 <- c()
>
> for(i in 1:nrow(filing_docs_alt)) {
+
+
+
+ local_filename <- file.path(raw_directory, gsub("https://www.sec.gov/Archives", "", filing_docs_alt$html_link[i]))
+ exists_in_mem_alt2 <- c(exists_in_mem_alt2, file.exists(local_filename))
+
+ }
>
> filing_docs_alt$exists_in_mem_alt2 <- exists_in_mem_alt2
>
>
> sum(filing_docs_alt$exists_in_mem)
[1] 756
>
> sum(filing_docs_alt$exists_in_mem_alt)
[1] 0
>
> sum(filing_docs_alt$exists_in_mem_alt2)
[1] 0
@iangow In the above, I am figuring out which documents from filing_docs_alt
actually exist in memory, as I downloaded some documents from here a few months ago. After setting EDGAR_DIR correctly to the 2 terabyte hard drive, I found that 756 documents exist in memory with the addresses of the first type counted in exists_in_mem
. These memory addresses are of the same form as for those filing documents which do not exist in filing_docs_alt
, ie. have the usual form for the html link and have downloaded = TRUE
in filing_docs_processed
.
By "memory", you mean on the HDD?
I think it might be worthwhile also to investigate how quickly we can do HTTP requests to test link validity. In R (it need not be R), this may provide some ideas.
As a start, pull 1000 documents from filing_docs
, create URLs from them, then time how long it takes just to check that they're valid.
BTW, I have this line in my ~/.profile
:
export EDGAR_DIR=/media/igow/2TB/
And this line in my ~/.Rprofile
(this might be empty for you):
Sys.setenv(EDGAR_DIR="/media/igow/2TB/")
So these already point to the "right" place.
@iangow Just defined this function (using the httr package)
get_url_status_code <- function(url) {
r <- GET(url)
return(status_code(r))
}
then did
> system.time(stat_vec <- unlist(lapply(filing_docs_alt$html_link, get_url_status_code)))
user system elapsed
12.767 0.357 31.130
> stat_vec
[1] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[46] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[91] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[136] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[181] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[226] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[271] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[316] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[361] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200
[406] 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[451] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[496] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[541] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[586] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[631] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[676] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[721] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[766] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[811] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[856] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[901] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[946] 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429 429
[991] 429 429 429 429 429 429 429 429 429 429
[ reached getOption("max.print") -- omitted 141 entries ]
So it took 31 seconds to do 1141 filing documents, though obviously there are many 429 error codes (429 is the code for Too Many Requests
OK. Using HEAD
instead of GET
seems to be faster. Note that things slow down once you start getting 429 codes. The following seems to run fine up to about 100 filings.
I think we could adapt the approach used in scrape_filing_docs.R
to create a table of status_codes
that we could populate for everything on filing_docs
. Then we could delete problematic cases from filing_docs
, then run a modified version of scrape_filing_docs.R
again. The modified version would get the href
and where this doesn't match the document
, put the correct partial URL in a different table or field.
Edited code to use Vectorize
:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(httr)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem = '5GB'")
rs <- dbExecute(pg, "SET search_path TO edgar, public")
filing_docs <- tbl(pg, "filing_docs")
get_html_link <- function(file_name, document) {
url <- gsub("(\\d{10})-(\\d{2})-(\\d{6})\\.txt", "\\1\\2\\3", file_name)
file.path("http://www.sec.gov/Archives", file.path(url, document))
}
get_status_code <- function(file_name, document) {
url <- get_html_link(file_name, document)
r <- HEAD(url)
return(status_code(r))
}
get_status_code <- Vectorize(get_status_code)
system.time({
status_codes <-
filing_docs %>%
select(file_name, document) %>%
collect(n = 100) %>%
mutate(status_code = get_status_code(file_name, document))
})
#> user system elapsed
#> 0.487 0.036 1.752
status_codes %>% count(status_code)
#> # A tibble: 1 x 2
#> status_code n
#> <int> <int>
#> 1 200 100
Created on 2019-01-24 by the reprex package (v0.2.1)
@iangow I don't seem to be getting the same performance with the above code. I got
> filing_docs <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
> system.time({
+ status_codes <-
+ filing_docs %>%
+ select(file_name, document) %>%
+ collect(n = 100) %>%
+ mutate(status_code = get_status_code(file_name, document))
+ })
user system elapsed
0.763 0.195 31.644
> status_codes %>% count(status_code)
# A tibble: 2 x 2
status_code n
<int> <int>
1 200 71
2 403 29
for a batch of 100, and
> system.time({
+ status_codes2 <-
+ filing_docs %>%
+ select(file_name, document) %>%
+ collect(n = 1000) %>%
+ mutate(status_code = get_status_code(file_name, document))
+ })
user system elapsed
7.257 1.457 303.850
> status_codes2 %>% count(status_code)
# A tibble: 2 x 2
status_code n
<int> <int>
1 200 727
2 403 273
for a batch of 1000.
I am already hitting EDGAR from 10.101.13.99
, so you will hit 403s sooner if you are coming from that address. Performance seems to take a hit with 403 errors (perhaps the site responds more slowly to stop DoS attacks).
So, maybe try it using RStudio on your computer.
I think there are about 60 million rows in filing_docs
. So at 2 seconds per 100, I calculate:
60e6/100*2/(24*60*60) = 13.89
days.
So once we have filing_docs
done, we could run this code to detect bad links that need to be replaced. No huge hurry on this (I guess 99.9% of links are good as is).
I think there are about 60 million rows in
filing_docs
. So at 2 seconds per 100, I calculate:
60e6/100*2/(24*60*60) = 13.89
days.So once we have
filing_docs
done, we could run this code to detect bad links that need to be replaced. No huge hurry on this (I guess 99.9% of links are good as is).
@bdcallen Are we able to identify problematic filings somehow? You said above:
@iangow I ran the program to download the item 5's and 8's on the weekend, this time with
EDGAR_DIR
set correctly. All but a few hundred filings were downloaded, so I explored why these few hundred weren't getting downloaded. It turns out these filings often have html links which do not conform to theedgar/data/[filing info]/[document_name]
form which is extracted by the functionget_file_path
, like this one.
How did you identify "these few hundred"?
I think the right idea would be to have a function that can handle these cases and to have a table that identifies the filings with HTML links that do not conform to the edgar/data/[filing info]/[document_name]
form. With such a table it should be easy to make a virtual table that contains everything.
@bdcallen Also see #39.
@iangow this issue has come up again in my downloading of Schedule's 13D and 13G. About 40000 documents didn't get downloaded, and the same issue occured. Essentially, I identified these cases by looking at the entries for which the downloaded
field in filing_docs_processed
was False, then looking at some of these cases manually, like this one , and looking at the html code for the tables. Note that if you click on the 0001.txt
document for this filing, the url you are taken to is not of the standard form. I wrote some functions which can handle and download these cases a while ago (see some of the posts above), this issue (and #39 I guess) is just a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.
I don't see 40,000 cases with downloaded
equal to FALSE
in filing_docs_processed
:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET work_mem='3GB'")
rs <- dbExecute(pg, "SET search_path TO edgar")
filing_docs_processed <- tbl(pg, "filing_docs_processed")
problems <-
filing_docs_processed %>%
filter(!downloaded)
problems %>%
count(document) %>%
arrange(desc(n))
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#> document n
#> <chr> <int64>
#> 1 0001.txt 398
#> 2 0002.txt 85
#> 3 0003.txt 37
#> 4 0004.txt 16
#> 5 0005.txt 10
#> 6 0006.txt 4
#> 7 0007.txt 3
#> 8 mbot_sc13d.txt 2
#> 9 0001172661-18-000468.txt 2
#> 10 0001748828-18-000002.txt 2
#> # … with more rows
problems %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 566
Created on 2020-01-07 by the reprex package (v0.3.0)
If all the problem cases can be flagged by looking at filing_docs_processed
, then one approach would be to write a small function that takes the problem cases, figures out the correct URL (could be extracted from the link on the HTML index page), and downloads from there. I think if we thereby created a table filing_docs_non_standard
with file_name
, document
, and (correct!)url
we would have all we need. I think the downloaded file should be stored in (say) edgar/data/921030/000092103000000039/0000921030-00-000039-0001.txt
so that our file structure mimics that of SEC.
… a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.
I think adding a column would be redundant in the overwhelming majority of cases. If you can make the "filing_docs_non_standard
" table above, we can make a view that has the additional column easily.
@iangow Sorry, I deleted the 40000 cases where downloaded
is false, as I wanted to run the program over that set again. Seeing as the majority were not getting downloaded, I halted the program, so that's why there are only few hundred now.
… a case of deciding what we want the table or column (I initially proposed a new column to handle this) to look like for these cases.
I think adding a column would be redundant in the overwhelming majority of cases. If you can make the "
filing_docs_non_standard
" table above, we can make a view that has the additional column easily.
I agree, an additional table is preferable since for the vast majority of cases the information contained in an additional column would be redundant. I actually made a couple of tables, filing_docs_alt
and edgar.filing_docs_processed_alt
crsp=> SELECT * FROM edgar.filing_docs_processed_alt LIMIT 10;
file_name | document | downloaded
---------------------------------------------+----------+------------
edgar/data/1021949/0000950128-00-000880.txt | 0001.htm | t
edgar/data/1028262/0000897101-00-001194.txt | 0001.htm | t
edgar/data/1028262/0000897101-00-001194.txt | 0002.htm | t
edgar/data/1012770/0000950128-00-000892.txt | 0001.htm | t
edgar/data/1003130/0000906318-00-000064.txt | 0001.htm | t
edgar/data/1007797/0001068238-01-000055.txt | 0001.htm | t
edgar/data/1021848/0000899681-00-000451.txt | 0001.htm | t
edgar/data/1021848/0000899681-00-000451.txt | 0002.htm | t
edgar/data/1021848/0000899681-01-000008.txt | 0001.htm | t
edgar/data/1021848/0000899681-01-000008.txt | 0002.htm | t
(10 rows)
crsp=> SELECT * FROM edgar.filing_docs_alt LIMIT 10;
seq | description | document | type | size | file_name | html_link
-----+-------------------------------------------+--------------------------+---------+-------+---------------------------------------------+--------------------------------------------------------------------------------------------------
1 | MELLON PREMIUM FINANCE LOAN MASTER TRUST | 0001.htm | 8-K | 17475 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880-0001.htm
2 | UPDATED TABLES | 0002.txt | EX-19.1 | 15430 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880-0002.txt
| Complete submission text file | 0000950128-00-000880.txt | | 34801 | edgar/data/1021949/0000950128-00-000880.txt | https://www.sec.gov/Archives/edgar/data/1021949/000095012800000880/0000950128-00-000880.txt
1 | | 0001.htm | 8-K | 6201 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194-0001.htm
2 | PRESS RELEASE | 0002.htm | EX-99.1 | 3736 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194-0002.htm
| Complete submission text file | 0000897101-00-001194.txt | | 11501 | edgar/data/1028262/0000897101-00-001194.txt | https://www.sec.gov/Archives/edgar/data/1028262/000089710100001194/0000897101-00-001194.txt
1 | MELLON BANK HOME EQUITY LOAN TRUST 1996-1 | 0001.htm | 8-K | 6729 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892-0001.htm
2 | MONTHLY SERVICER & INVESTOR REPORT | 0002.txt | EX-20 | 51932 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892-0002.txt
| Complete submission text file | 0000950128-00-000892.txt | | 60507 | edgar/data/1012770/0000950128-00-000892.txt | https://www.sec.gov/Archives/edgar/data/1012770/000095012800000892/0000950128-00-000892.txt
1 | | 0001.htm | 8-K | 6952 | edgar/data/1003130/0000906318-00-000064.txt | https://www.sec.gov/Archives/edgar/data/1003130/000090631800000064/0000906318-00-000064-0001.htm
which were to test this idea when I was helping James with downloading the 8-K's (which was a while ago). Furthermore, it seems I also included code for this job in the file filing_docs/download_filing_doc_exceptions.R.
@iangow I have amended filing_docs/download_filing_doc_exceptions.R
in the commit above, so that it writes to a new form I've decided on for filing_docs_alt
. I have deleted the old filing_docs_alt
. The new filing_docs_alt
is intended to have a column path_alt
, which contains the path to the downloaded file as well as being the html link with the https://www.sec.gov/Archives/
stem removed. It also contains its own downloaded
column, which is amended to true if the files are downloaded successfully. I'm running the program now, seems to be mostly doing the right thing (despite around 200 duplicates it did at the start of the running, which it is not doing now). It should take around 6.5 hours to run. I will analyze the table and results tomorrow.
@iangow After fixing some errors with the table joins download_filing_doc_exceptions.R
, which was leading to unnecessary duplication in filing_docs_alt
, I got rid of the duplication and then ran the program. It has run successfully, with just a handful of documents not being processed
crsp=# SELECT COUNT(*) FROM edgar.filing_docs_alt;
count
-------
47799
(1 row)
crsp=# SELECT COUNT(*) FROM edgar.filing_docs_processed
crsp-# WHERE NOT downloaded;
count
-------
47807
(1 row)
Furthermore, doing
failed_to_download <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed WHERE NOT downloaded"))
filing_docs_alt <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_alt"))
files <-
failed_to_download %>%
anti_join(filing_docs_alt, by = c("file_name", "document")) %>%
filter(document %~*% "txt$") %>%
collect()
showed that these eight documents are
> files
# A tibble: 8 x 3
file_name document downloaded
* <chr> <chr> <lgl>
1 edgar/data/1748828/0001748828-18-000002.txt mbot_sc13d.txt FALSE
2 edgar/data/1748828/0001748828-18-000002.txt 0001748828-18-000002.txt FALSE
3 edgar/data/883975/0001748828-18-000002.txt mbot_sc13d.txt FALSE
4 edgar/data/883975/0001748828-18-000002.txt 0001748828-18-000002.txt FALSE
5 edgar/data/1094742/0001172661-17-000448.txt 0001172661-17-000448.txt FALSE
6 edgar/data/1332905/0001172661-17-000448.txt 0001172661-17-000448.txt FALSE
7 edgar/data/1094742/0001172661-18-000468.txt 0001172661-18-000468.txt FALSE
8 edgar/data/1332905/0001172661-18-000468.txt 0001172661-18-000468.txt FALSE
Having had a look at the pages of the filings involved, it seems these are filings which are not on the system anymore (ie. they lead to the temporarily unavailable page)
@iangow Also, the downloading was very successful, there are only 24 entries for which downloaded
is false
crsp=# SELECT * FROM edgar.filing_docs_alt WHERE NOT downloaded;
file_name | document | downloaded | seq | description | type | size | path_alt
---------------------------------------------+--------------------------+------------+-----+-------------------------------+----------+--------+----------------------------------------------------------------
edgar/data/919549/0000065103-95-000120.txt | 0000065103-95-000120.txt | f | | Complete submission text file | | 0 | edgar/data/65100/0000065103-95-000120.txt
edgar/data/65100/0000065103-95-000120.txt | 0000065103-95-000120.txt | f | | Complete submission text file | | 0 | edgar/data/65100/0000065103-95-000120.txt
edgar/data/909465/0000909465-95-000005.txt | 0000909465-95-000005.txt | f | | Complete submission text file | | 127586 | edgar/data/789625/0000909465-95-000005.txt
edgar/data/789625/0000909465-95-000005.txt | 0000909465-95-000005.txt | f | | Complete submission text file | | 127586 | edgar/data/789625/0000909465-95-000005.txt
edgar/data/887777/0000887777-03-000003.txt | phar2-2003.txt | f | 1 | | SC 13G | 8573 | edgar/data/887777/000088777703000003/phar2-2003.txt
edgar/data/1072546/0000887777-03-000003.txt | phar2-2003.txt | f | 1 | | SC 13G | 8573 | edgar/data/887777/000088777703000003/phar2-2003.txt
edgar/data/1256394/0001256394-03-000004.txt | her2003.txt | f | 1 | | SC 13G | 7014 | edgar/data/1021604/000125639403000004/her2003.txt
edgar/data/1021604/0001256394-03-000004.txt | her2003.txt | f | 1 | | SC 13G | 7014 | edgar/data/1021604/000125639403000004/her2003.txt
edgar/data/1063085/0001172661-03-000034.txt | westcap1103.txt | f | 1 | FORM 13G HOLDING REPORT | SC 13G | 519 | edgar/data/1010614/000117266103000034/westcap1103.txt
edgar/data/1010614/0001172661-03-000034.txt | westcap1103.txt | f | 1 | FORM 13G HOLDING REPORT | SC 13G | 519 | edgar/data/1010614/000117266103000034/westcap1103.txt
edgar/data/1010614/0001172661-03-000035.txt | westcap11603.txt | f | 1 | FORM 13G HOLDINGS REPORT | SC 13G | 512 | edgar/data/722392/000117266103000035/westcap11603.txt
edgar/data/1010614/0001172661-03-000050.txt | westcap1203.txt | f | 1 | FORM 13G FILING | SC 13G | 527 | edgar/data/870826/000117266103000050/westcap1203.txt
edgar/data/1013149/0001172661-04-000001.txt | westcap0104.txt | f | 1 | FORM 13G HOLDINGS REPORT | SC 13G | 507 | edgar/data/1010614/000117266104000001/westcap0104.txt
edgar/data/722392/0001172661-03-000035.txt | westcap11603.txt | f | 1 | FORM 13G HOLDINGS REPORT | SC 13G | 512 | edgar/data/722392/000117266103000035/westcap11603.txt
edgar/data/870826/0001172661-03-000050.txt | westcap1203.txt | f | 1 | FORM 13G FILING | SC 13G | 527 | edgar/data/870826/000117266103000050/westcap1203.txt
edgar/data/1010614/0001172661-04-000001.txt | westcap0104.txt | f | 1 | FORM 13G HOLDINGS REPORT | SC 13G | 507 | edgar/data/1010614/000117266104000001/westcap0104.txt
edgar/data/1072006/0001072006-01-500007.txt | form13da-091701.txt | f | 1 | BLUE LAKE 13D/A | SC 13D/A | 8231 | edgar/data/1000285/000107200601500007/form13da-091701.txt
edgar/data/1072006/0001072006-01-500007.txt | 0001072006-01-500007.txt | f | | Complete submission text file | | 9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/1000285/0001072006-01-500007.txt | form13da-091701.txt | f | 1 | BLUE LAKE 13D/A | SC 13D/A | 8231 | edgar/data/1000285/000107200601500007/form13da-091701.txt
edgar/data/1000285/0001072006-01-500007.txt | 0001072006-01-500007.txt | f | | Complete submission text file | | 9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/37785/0000940180-01-500261.txt | dsc13d.txt | f | 1 | SCHEDULE 13D | SC 13D | 31868 | edgar/data/37785/000094018001500261/dsc13d.txt
edgar/data/37785/0000940180-01-500261.txt | 0000940180-01-500261.txt | f | | Complete submission text file | | 33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
edgar/data/906193/0000940180-01-500261.txt | dsc13d.txt | f | 1 | SCHEDULE 13D | SC 13D | 31868 | edgar/data/37785/000094018001500261/dsc13d.txt
edgar/data/906193/0000940180-01-500261.txt | 0000940180-01-500261.txt | f | | Complete submission text file | | 33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
(24 rows)
I have checked these documents by path_alt
, and they all lead to pages like this one, so they seem to be cases that genuinely should have downloaded
set to false.
@iangow
crsp=# SELECT * FROM edgar.filing_docs_alt
WHERE path_alt ~ 'edgar/data/[0-9]+/[0-9]+/[0-9]{10}-[0-9]{2}-[0-9]{6}\.txt';
file_name | document | downloaded | seq | description | type | size | path_alt
---------------------------------------------+--------------------------+------------+-----+-------------------------------+------+--------+----------------------------------------------------------------
edgar/data/1064122/0000806085-08-000112.txt | 0000806085-08-000112.txt | t | | Complete submission text file | | 138677 | edgar/data/806085/000080608508000112/0000806085-08-000112.txt
edgar/data/1072006/0001072006-01-500007.txt | 0001072006-01-500007.txt | f | | Complete submission text file | | 9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/1000285/0001072006-01-500007.txt | 0001072006-01-500007.txt | f | | Complete submission text file | | 9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/37785/0000940180-01-500261.txt | 0000940180-01-500261.txt | f | | Complete submission text file | | 33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
edgar/data/906193/0000940180-01-500261.txt | 0000940180-01-500261.txt | f | | Complete submission text file | | 33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
(5 rows)
So there are 5 entries in which the path_alt has the usual form for the link to the document, with only these two
file_name | document | downloaded | seq | description | type | size | path_alt
---------------------------------------------+--------------------------+------------+-----+-------------------------------+------+--------+----------------------------------------------------------------
edgar/data/1000285/0001072006-01-500007.txt | 0001072006-01-500007.txt | f | | Complete submission text file | | 9820 | edgar/data/1000285/000107200601500007/0001072006-01-500007.txt
edgar/data/37785/0000940180-01-500261.txt | 0000940180-01-500261.txt | f | | Complete submission text file | | 33505 | edgar/data/37785/000094018001500261/0000940180-01-500261.txt
match the usual form derived given their file_name
and document
. It's worth mentioning that these two failed to be downloaded. So there is the possibility that cases with the usual form get written into filing_docs_alt
. It seems, however, that these cases are going to be ones in which there is no document to be downloaded from the link, that there are relatively few of them, and that these cases just correspond to a reprocessing from the program which constructs filing_docs_alt
. If we can accept this, I think we can close this issue. Otherwise, I could do one last commit to make sure these cases don't appear in the table.
@bdcallen I agree that this is fine.
@iangow I ran the program to download the item 5's and 8's on the weekend, this time with
EDGAR_DIR
set correctly. All but a few hundred filings were downloaded, so I explored why these few hundred weren't getting downloaded. It turns out these filings often have html links which do not conform to theedgar/data/[filing info]/[document_name]
form which is extracted by the functionget_file_path
, like this one.