Closed iangow closed 6 years ago
@bdcallen I made this into an issue as a way of tracking things and keeping things focused. Once you get this going, we could go to the next step (the remaining steps should be pretty easy if you can get this one working).
@iangow I have finished a couple of functions which produce the desired data frame. The function getLastUpdate
getLastUpdate <- function(year, quarter) {
library(rvest)
library(lubridate)
# Scrape the html table from the website for the given year and quarter
filetbl <- read_html(paste("https://www.sec.gov/Archives/edgar/full-index/",
year,"/QTR", quarter, "/",sep="")) %>% html_nodes("table") %>% .[[1]] %>% html_table()
colnames(filetbl)[[3]] <- "Last_Modified"
# company.gz corresponds to the first row of the html table, so take the first element of filetbl$Last_Modified
# to return the date of the last update for the year and quarter
return(mdy_hms(filetbl$Last_Modified[1], tz = "UTC"))
}
gets the time of the last modification to the file company.gz for the given year and quarter, scraped from the SEC website.
The function makeUpdatesDataframe
makeUpdatesDataframe <- function() {
library(lubridate)
library(dplyr)
date <- today(tz = "UTC")
numrows <- (year(date) - 1993) * 4 + quarter(date)
index <- 1:numrows
years <- 1993 + (index - 1)%/%4L
quarters <- 1 + (index - 1)%%4L
updates <- data.frame(year = years, quarter = quarters)
updates <- updates %>% rowwise() %>% mutate(last_modified = getLastUpdate(year, quarter))
updates$year <- as.integer(updates$year)
updates$quarter <- as.integer(updates$quarter)
return(updates)
}
then applies getLastUpdate to each row of a dataframe updates, which initially has two columns year and quarter, to produce the last_modified column, and then returns the data frame. All times are assumed to be in UTC
@bdcallen Thanks. I committed your code. I then modified and expanded on it. See the two commits above. Still a little work to be done to actually get the index files using the results of this.
It may make sense to put the functions from get_filings.R
into a separate file that can be source()
ed by other files (a bit like import some_function from get_filings
in Python).
@bdcallen
Actually, the data are stored in PostgreSQL as timestamp with time zone
(this is as if they are stored in UTC, which I guess they are). The data are output in the local time zone when you look at them. See below for proof:
> Sys.setenv(TZ = "America/New_York")
> head(index_last_modified)
# Source: lazy query [?? x 3]
# Database: postgres 9.6.7 [igow@localhost:5432/crsp]
year quarter last_modified
<int> <int> <dttm>
1 1993 2 2014-04-16 15:58:38
2 1993 3 2014-04-16 15:58:55
3 1993 4 2014-04-16 15:59:12
4 1994 1 2014-04-16 15:59:30
5 1994 2 2014-04-16 15:59:51
6 1994 3 2014-04-16 16:00:10
> Sys.setenv(TZ = "Australia/Melbourne")
> head(index_last_modified)
# Source: lazy query [?? x 3]
# Database: postgres 9.6.7 [igow@localhost:5432/crsp]
year quarter last_modified
<int> <int> <dttm>
1 1993 2 2014-04-17 05:58:38
2 1993 3 2014-04-17 05:58:55
3 1993 4 2014-04-17 05:59:12
4 1994 1 2014-04-17 05:59:30
5 1994 2 2014-04-17 05:59:51
6 1994 3 2014-04-17 06:00:10
And here's the same thing in SQL:
igowmbp15:~ igow$ psql -h iangow.me -d crsp
psql (9.6.5, server 9.6.7)
Type "help" for help.
crsp=# SET time zone 'Australia/Melbourne';
SET
crsp=# SELECT * FROM edgar.index_last_modified WHERE year=1994 AND quarter=3;
year | quarter | last_modified
------+---------+------------------------
1994 | 3 | 2014-04-17 06:00:10+10
(1 row)
crsp=# SET time zone 'America/New_York';
SET
crsp=# SELECT * FROM edgar.index_last_modified WHERE year=1994 AND quarter=3;
year | quarter | last_modified
------+---------+------------------------
1994 | 3 | 2014-04-16 16:00:10-04
(1 row)
Note that the original text scraped from the site is 04/16/2014 04:00:10 PM
.
See here for an example.
Basic idea is to create a data frame with
year
,quarter
, andlast_modified
columns.