mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Scrape "last modified" data for index files. #9

Closed iangow closed 6 years ago

iangow commented 6 years ago

See here for an example.

Basic idea is to create a data frame with year, quarter, and last_modified columns.

iangow commented 6 years ago

@bdcallen I made this into an issue as a way of tracking things and keeping things focused. Once you get this going, we could go to the next step (the remaining steps should be pretty easy if you can get this one working).

bdcallen commented 6 years ago

@iangow I have finished a couple of functions which produce the desired data frame. The function getLastUpdate

getLastUpdate <- function(year, quarter) {

    library(rvest)
    library(lubridate)
    # Scrape the html table from the website for the given year and quarter
    filetbl <- read_html(paste("https://www.sec.gov/Archives/edgar/full-index/",
                               year,"/QTR", quarter, "/",sep="")) %>% html_nodes("table") %>% .[[1]] %>% html_table()
    colnames(filetbl)[[3]] <- "Last_Modified"

    # company.gz corresponds to the first row of the html table, so take the first element of filetbl$Last_Modified
    # to return the date of the last update for the year and quarter
    return(mdy_hms(filetbl$Last_Modified[1], tz = "UTC"))

}

gets the time of the last modification to the file company.gz for the given year and quarter, scraped from the SEC website.

The function makeUpdatesDataframe

makeUpdatesDataframe <- function() {

    library(lubridate)
    library(dplyr)

    date <- today(tz = "UTC")
    numrows <- (year(date) - 1993) * 4 + quarter(date)
    index <- 1:numrows
    years <- 1993 + (index - 1)%/%4L
    quarters <- 1 + (index - 1)%%4L
    updates <- data.frame(year = years, quarter = quarters)
    updates <- updates %>% rowwise() %>% mutate(last_modified = getLastUpdate(year, quarter))
    updates$year <- as.integer(updates$year)
    updates$quarter <- as.integer(updates$quarter)
    return(updates)

}

then applies getLastUpdate to each row of a dataframe updates, which initially has two columns year and quarter, to produce the last_modified column, and then returns the data frame. All times are assumed to be in UTC

iangow commented 6 years ago

@bdcallen Thanks. I committed your code. I then modified and expanded on it. See the two commits above. Still a little work to be done to actually get the index files using the results of this.

It may make sense to put the functions from get_filings.R into a separate file that can be source()ed by other files (a bit like import some_function from get_filings in Python).

iangow commented 6 years ago

@bdcallen

Actually, the data are stored in PostgreSQL as timestamp with time zone (this is as if they are stored in UTC, which I guess they are). The data are output in the local time zone when you look at them. See below for proof:

> Sys.setenv(TZ = "America/New_York")
> head(index_last_modified)
# Source:   lazy query [?? x 3]
# Database: postgres 9.6.7 [igow@localhost:5432/crsp]
   year quarter last_modified      
  <int>   <int> <dttm>             
1  1993       2 2014-04-16 15:58:38
2  1993       3 2014-04-16 15:58:55
3  1993       4 2014-04-16 15:59:12
4  1994       1 2014-04-16 15:59:30
5  1994       2 2014-04-16 15:59:51
6  1994       3 2014-04-16 16:00:10
> Sys.setenv(TZ = "Australia/Melbourne")
> head(index_last_modified)
# Source:   lazy query [?? x 3]
# Database: postgres 9.6.7 [igow@localhost:5432/crsp]
   year quarter last_modified      
  <int>   <int> <dttm>             
1  1993       2 2014-04-17 05:58:38
2  1993       3 2014-04-17 05:58:55
3  1993       4 2014-04-17 05:59:12
4  1994       1 2014-04-17 05:59:30
5  1994       2 2014-04-17 05:59:51
6  1994       3 2014-04-17 06:00:10
iangow commented 6 years ago

And here's the same thing in SQL:

igowmbp15:~ igow$ psql -h iangow.me -d crsp
psql (9.6.5, server 9.6.7)
Type "help" for help.

crsp=# SET time zone 'Australia/Melbourne';
SET
crsp=# SELECT * FROM edgar.index_last_modified WHERE year=1994 AND quarter=3;
 year | quarter |     last_modified      
------+---------+------------------------
 1994 |       3 | 2014-04-17 06:00:10+10
(1 row)

crsp=# SET time zone 'America/New_York';
SET
crsp=# SELECT * FROM edgar.index_last_modified WHERE year=1994 AND quarter=3;
 year | quarter |     last_modified      
------+---------+------------------------
 1994 |       3 | 2014-04-16 16:00:10-04
(1 row)

Note that the original text scraped from the site is 04/16/2014 04:00:10 PM.