vickyyin1493 / Replication-FFJR

0 stars 0 forks source link

Left_join will give us all the splits but inner_join will drop 12 splits #7

Closed iangow closed 1 year ago

iangow commented 1 year ago

This is related to #5, but more focused.

Here you need to zoom in on the specific issue. The following code does this. Nothing extraneous; just the code needed to reproduce the issue.

From the following, it seems that there are cases where dclrdt is not on day_indexes. For example, the announcements occur on a weekend or another non-trading date. This is precisely the issue addressed by the code below, which comes from here. This was not discussed in the FFJR chapter, as it is introduced later.

annc_dates <-
  tibble(annc_date = seq(min(trading_dates$date), 
                         max(trading_dates$date), 1)) %>%
  left_join(trading_dates, by = join_by(annc_date == date)) %>%
  fill(td, .direction = "up")

But there's a function get_event_dates() in the farr package that implements this. So you could use this.

nyse_splits <-
  nyse_splits_raw %>%
  collect() %>%
  get_event_dates(pg, permno = "permno", event_date = "dclrdt") %>%
  rename(ex_day_index = start_date) %>%
  select(permno, ex_day_index)

Using this gives you announcement dates for all rows.

Minimal code to produce issue

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)      # For window_order

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO crsp")
dsi <- tbl(pg, "dsi")
stocknames <- tbl(pg, "stocknames")
dsedist <- tbl(pg, "dsedist")

splits <-
  dsedist %>%
  filter(dclrdt >= "1990-01-01", dclrdt <= "2022-12-31",
         distcd %in% c(5523L, 5533L),
         facshr >= 0.25) %>%
  mutate(split = TRUE) %>%
  select(permno, split, dclrdt, facshr) 

nyse_stocks <-
  stocknames %>%
  filter(exchcd == 1,
         substr(as.character(shrcd), 1L, 1L) == "1") %>%
  select(permno, namedt, nameenddt) 

nyse_splits_raw <-
  splits %>%
  inner_join(nyse_stocks, by = "permno") %>%
  filter(dclrdt >= namedt, dclrdt <= nameenddt) 

## Problem 1: Left_join will give us all the splits but inner_join will drop 12 splits.

#| include: false
day_indexes <-
  dsi %>%
  mutate(day = date) %>%
  window_order(day) %>%
  mutate(day_index = row_number()) %>%
  select(date, day, day_index)

nyse_splits_inner <-
  nyse_splits_raw %>%
  mutate(day = dclrdt) %>%
  inner_join(day_indexes, by = "day") %>% 
  rename(ex_day_index = day_index) %>% 
  select(-namedt, -nameenddt, -date) %>%
  compute()

nyse_splits_left <-
  nyse_splits_raw %>%
  mutate(day = dclrdt) %>%
  left_join(day_indexes, by = "day") %>% 
  rename(ex_day_index = day_index) %>% 
  select(-namedt, -nameenddt, -date) %>%
  compute()

nyse_splits_left %>%
  anti_join(nyse_splits_inner) %>%
  collect()
#> Joining with `by = join_by(permno, split, dclrdt, facshr, day, ex_day_index)`
#> # A tibble: 12 × 6
#>    permno split dclrdt     facshr day        ex_day_index
#>     <int> <lgl> <date>      <dbl> <date>            <int>
#>  1  80599 TRUE  2000-09-23    1   2000-09-23           NA
#>  2  70332 TRUE  1998-05-30    1   1998-05-30           NA
#>  3  34948 TRUE  2003-10-05    1   2003-10-05           NA
#>  4  49429 TRUE  1992-05-16    2   1992-05-16           NA
#>  5  24248 TRUE  1999-01-18    1   1999-01-18           NA
#>  6  20124 TRUE  1998-05-10    2   1998-05-10           NA
#>  7  32986 TRUE  1994-04-16    0.5 1994-04-16           NA
#>  8  11403 TRUE  1996-05-26    0.5 1996-05-26           NA
#>  9  12140 TRUE  1996-02-19    1   1996-02-19           NA
#> 10  91393 TRUE  2014-05-03    2   2014-05-03           NA
#> 11  77584 TRUE  1998-03-22    0.5 1998-03-22           NA
#> 12  14714 TRUE  2021-10-30    3   2021-10-30           NA

Created on 2023-06-08 with reprex v2.0.2

Code with correction

You can see at the bottom here that I have all rows without needing a left_join. If you check you will see no rows with NA values in ex_day_index. This may be what is causing your later issue when you are using left_join.

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)      # For window_order
library(farr)
#> 
#> Attaching package: 'farr'
#> The following object is masked from 'package:base':
#> 
#>     truncate

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO crsp")
dsi <- tbl(pg, "dsi")
stocknames <- tbl(pg, "stocknames")
dsedist <- tbl(pg, "dsedist")

splits <-
  dsedist %>%
  filter(dclrdt >= "1990-01-01", dclrdt <= "2022-12-31",
         distcd %in% c(5523L, 5533L),
         facshr >= 0.25) %>%
  mutate(split = TRUE) %>%
  select(permno, split, dclrdt, facshr) 

nyse_stocks <-
  stocknames %>%
  filter(exchcd == 1,
         substr(as.character(shrcd), 1L, 1L) == "1") %>%
  select(permno, namedt, nameenddt) 

nyse_splits_raw <-
  splits %>%
  inner_join(nyse_stocks, by = "permno") %>%
  filter(dclrdt >= namedt, dclrdt <= nameenddt) %>%
  collect()

## Problem 1: Left_join will give us all the splits but inner_join will drop 12 splits.

#| include: false
day_indexes <-
  dsi %>%
  mutate(day = date) %>%
  window_order(day) %>%
  mutate(day_index = row_number()) %>%
  select(date, day, day_index) %>%
  collect()

nyse_splits <-
  nyse_splits_raw %>%
  get_event_dates(pg, permno = "permno", event_date = "dclrdt") %>%
  inner_join(nyse_splits_raw, by = join_by(permno, dclrdt)) %>%
  rename(ex_day_index = start_date) %>%
  select(permno, ex_day_index, dclrdt, facshr, split, ex_day_index)

nyse_splits_left <-
  nyse_splits_raw %>%
  mutate(day = dclrdt) %>%
  left_join(day_indexes, by = "day") %>% 
  rename(ex_day_index = day_index) %>% 
  select(-namedt, -nameenddt, -date) 

nyse_splits_left %>%
  anti_join(nyse_splits, by = join_by(permno, split, dclrdt, facshr))
#> # A tibble: 0 × 6
#> # ℹ 6 variables: permno <int>, split <lgl>, dclrdt <date>, facshr <dbl>,
#> #   day <date>, ex_day_index <int>

Created on 2023-06-08 with reprex v2.0.2

iangow commented 1 year ago

@vickyyin1493 @shizhe75 See above.

vickyyin1493 commented 1 year ago

We were wondering why are we renaming start_date as ex_day_index in the following code:

nyse_splits <- nyse_splits_raw %>% get_event_dates(pg, permno = "permno", event_date = "dclrdt") %>% inner_join(nyse_splits_raw, by = join_by(permno, dclrdt)) %>% rename(ex_day_index = start_date) %>% select(permno, ex_day_index, dclrdt, facshr, split, ex_day_index)

When we proceeded, this caused problems with the calculation for day_rel_ex as ex_day_index is a date rather than an integer.

ℹ In argument: day_rel_ex = day_index - ex_day_index. Caused by error in -.Date: ! can only subtract from "Date" objects Backtrace:

  1. ... %>% select(permno, dclrdt, date, ret, day_rel_ex)
    1. base::-.Date(day_index, ex_day_index)
    2. base::stop("can only subtract from \"Date\" objects")

@iangow @shizhe75

iangow commented 1 year ago

You might need to use get_event_rets(), not get_event_dates() to get relative_td and rename that to ex_day_index.

iangow commented 1 year ago

See here for an illustration of get_event_rets().

iangow commented 1 year ago

Alternatively, the function allows one to turn a date into a "trading date":

> farr::get_trading_dates(pg)
# A tibble: 25,549 × 2
   date          td
   <date>     <int>
 1 1925-12-31     1
 2 1926-01-02     2
 3 1926-01-04     3
 4 1926-01-05     4
 5 1926-01-06     5
 6 1926-01-07     6
 7 1926-01-08     7
 8 1926-01-09     8
 9 1926-01-11     9
10 1926-01-12    10
# ℹ 25,539 more rows
# ℹ Use `print(n = ...)` to see more rows
iangow commented 1 year ago

Actually looking at your code, you might just use day_indexes (inner_join() should now work), as it's effectively the same thing as farr::get_trading_dates(pg).