Problem 4 We are losing a lot of data points in this step

iangow commented 1 year ago

We think the following codes are fine and suspect that the problem occurs when we left_join nyse_splits and nyse_divs

You actually say "the following code is fine …" (code in this context is a mass noun).

I'd be pretty confident that the issue is this line

filter(days_trailing == 365, days_forward == 365) %>%

With mostly data, it makes sense to require 12 months of date before and after. But exactly 365 days is going to be problematic, I'd guess. You may want to aggregate the data by month before calculating.

Alternatively, look at the distribution of days_trailing and days_forward before applying a filter.

iangow commented 1 year ago

@vickyyin1493 @shizhe75 See above

vickyyin1493 commented 1 year ago

We looked at the distribution of days_trailing and days_forward and found that 365 days might be the most appropriate to use as it has the highest occurrence.

(We were trying to have a reproducible example but it takes forever to run, we'll update this once it runs.)

library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr)      # For window_order
library(ggplot2)
library(farr)

pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")

rs <- dbExecute(pg, "SET search_path TO crsp")
dsi <- tbl(pg, "dsi")
dsf <- tbl(pg, "dsf")
stocknames <- tbl(pg, "stocknames")
dsedist <- tbl(pg, "dsedist")

splits <-
  dsedist %>%
  filter(dclrdt >= "1990-01-01", dclrdt <= "2022-12-31",
         distcd %in% c(5523L, 5533L),
         facshr >= 0.25) %>%
  mutate(split = TRUE) %>%
  select(permno, split, dclrdt, facshr) 

nyse_stocks <-
  stocknames %>%
  filter(exchcd == 1,
         substr(as.character(shrcd), 1L, 1L) == "1") %>%
  select(permno, namedt, nameenddt) 

nyse_splits_raw <-
  splits %>%
  inner_join(nyse_stocks, by = "permno") %>%
  filter(dclrdt >= namedt, dclrdt <= nameenddt) %>%
  collect() 

div_days <-
  dsedist %>%
  filter(substr(as.character(distcd), 1L, 1L) == "1") %>%
  mutate(day =  dclrdt) %>%
  group_by(permno, day) %>%
  summarize(divamt = sum(divamt, na.rm = TRUE), .groups = "drop")

nyse_divs_raw <-
  dsf %>%
  inner_join(nyse_stocks, by = "permno") %>%
  filter(between(date, namedt, nameenddt)) %>%
  mutate(day =  date) %>%
  select(permno, date, day, cfacshr) %>%
  left_join(div_days, by = c("permno", "day")) %>%
  mutate(divamt = coalesce(divamt/cfacshr, 0)) %>%
  select(permno, day, divamt)

#investigate distribution for days_forward
nyse_divs_raw %>%
  group_by(permno) %>%
  window_order(day) %>%
  window_frame(from = -364, to = 0) %>%
  mutate(div_trailing = sum(divamt, na.rm = TRUE),
         days_trailing = sum(1, na.rm = TRUE)) %>%
  window_frame(from = 1, to = 365) %>%
  mutate(div_forward = sum(divamt, na.rm = TRUE),
         days_forward = sum(1, na.rm = TRUE)) %>%
  ggplot(aes(days_forward)) +
  geom_histogram() 

#investigate distribution for days_trailing
nyse_divs_raw %>%
  group_by(permno) %>%
  window_order(day) %>%
  window_frame(from = -364, to = 0) %>%
  mutate(div_trailing = sum(divamt, na.rm = TRUE),
         days_trailing = sum(1, na.rm = TRUE)) %>%
  window_frame(from = 1, to = 365) %>%
  mutate(div_forward = sum(divamt, na.rm = TRUE),
         days_forward = sum(1, na.rm = TRUE)) %>%
  ggplot(aes(days_trailing)) +
  geom_histogram()

We tried to remove the filter which gave us much more observations, however, there are more FALSEs than TRUEs. We were not sure whether it was appropriate.

@iangow @shizhe75

vickyyin1493 commented 1 year ago

We decided to aggregate by month before calculating. We haven't tested the code yet as the previous sections are still running. We created another issue #12 for (potentially) speeding up this process.

nyse_divs <-
  nyse_divs_raw %>%
  group_by(permno) %>%
  window_order(month) %>%
  window_frame(from = -11, to = 0) %>%
  mutate(mths_trailing = sum(1, na.rm = TRUE)) %>%
  window_frame(from = 1, to = 12) %>%
  mutate(mths_forward = sum(1, na.rm = TRUE)) %>%
  filter(mths_trailing == 12, mths_forward == 12) %>%
  ungroup() %>%
  group_by(permno) %>%
  window_order(day) %>%
  window_frame(from = -364, to = 0) %>%
  mutate(div_trailing = sum(divamt, na.rm = TRUE),
         days_trailing = sum(1, na.rm = TRUE)) %>%
  window_frame(from = 1, to = 365) %>%
  mutate(div_forward = sum(divamt, na.rm = TRUE),
         days_forward = sum(1, na.rm = TRUE)) %>%
  ungroup() %>%
  select(permno, day, div_trailing, div_forward) %>%
  collect()

@iangow @shizhe75

vickyyin1493 / Replication-FFJR

Problem 4 We are losing a lot of data points in this step #9