Open iangow opened 1 year ago
@vickyyin1493 @shizhe75 See above
We looked at the distribution of days_trailing
and days_forward
and found that 365 days might be the most appropriate to use as it has the highest occurrence.
(We were trying to have a reproducible example but it takes forever to run, we'll update this once it runs.)
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr) # For window_order
library(ggplot2)
library(farr)
pg <- dbConnect(RPostgres::Postgres(), bigint = "integer")
rs <- dbExecute(pg, "SET search_path TO crsp")
dsi <- tbl(pg, "dsi")
dsf <- tbl(pg, "dsf")
stocknames <- tbl(pg, "stocknames")
dsedist <- tbl(pg, "dsedist")
splits <-
dsedist %>%
filter(dclrdt >= "1990-01-01", dclrdt <= "2022-12-31",
distcd %in% c(5523L, 5533L),
facshr >= 0.25) %>%
mutate(split = TRUE) %>%
select(permno, split, dclrdt, facshr)
nyse_stocks <-
stocknames %>%
filter(exchcd == 1,
substr(as.character(shrcd), 1L, 1L) == "1") %>%
select(permno, namedt, nameenddt)
nyse_splits_raw <-
splits %>%
inner_join(nyse_stocks, by = "permno") %>%
filter(dclrdt >= namedt, dclrdt <= nameenddt) %>%
collect()
div_days <-
dsedist %>%
filter(substr(as.character(distcd), 1L, 1L) == "1") %>%
mutate(day = dclrdt) %>%
group_by(permno, day) %>%
summarize(divamt = sum(divamt, na.rm = TRUE), .groups = "drop")
nyse_divs_raw <-
dsf %>%
inner_join(nyse_stocks, by = "permno") %>%
filter(between(date, namedt, nameenddt)) %>%
mutate(day = date) %>%
select(permno, date, day, cfacshr) %>%
left_join(div_days, by = c("permno", "day")) %>%
mutate(divamt = coalesce(divamt/cfacshr, 0)) %>%
select(permno, day, divamt)
#investigate distribution for days_forward
nyse_divs_raw %>%
group_by(permno) %>%
window_order(day) %>%
window_frame(from = -364, to = 0) %>%
mutate(div_trailing = sum(divamt, na.rm = TRUE),
days_trailing = sum(1, na.rm = TRUE)) %>%
window_frame(from = 1, to = 365) %>%
mutate(div_forward = sum(divamt, na.rm = TRUE),
days_forward = sum(1, na.rm = TRUE)) %>%
ggplot(aes(days_forward)) +
geom_histogram()
#investigate distribution for days_trailing
nyse_divs_raw %>%
group_by(permno) %>%
window_order(day) %>%
window_frame(from = -364, to = 0) %>%
mutate(div_trailing = sum(divamt, na.rm = TRUE),
days_trailing = sum(1, na.rm = TRUE)) %>%
window_frame(from = 1, to = 365) %>%
mutate(div_forward = sum(divamt, na.rm = TRUE),
days_forward = sum(1, na.rm = TRUE)) %>%
ggplot(aes(days_trailing)) +
geom_histogram()
We tried to remove the filter which gave us much more observations, however, there are more FALSEs than TRUEs. We were not sure whether it was appropriate.
@iangow @shizhe75
We decided to aggregate by month before calculating. We haven't tested the code yet as the previous sections are still running. We created another issue #12 for (potentially) speeding up this process.
nyse_divs <-
nyse_divs_raw %>%
group_by(permno) %>%
window_order(month) %>%
window_frame(from = -11, to = 0) %>%
mutate(mths_trailing = sum(1, na.rm = TRUE)) %>%
window_frame(from = 1, to = 12) %>%
mutate(mths_forward = sum(1, na.rm = TRUE)) %>%
filter(mths_trailing == 12, mths_forward == 12) %>%
ungroup() %>%
group_by(permno) %>%
window_order(day) %>%
window_frame(from = -364, to = 0) %>%
mutate(div_trailing = sum(divamt, na.rm = TRUE),
days_trailing = sum(1, na.rm = TRUE)) %>%
window_frame(from = 1, to = 365) %>%
mutate(div_forward = sum(divamt, na.rm = TRUE),
days_forward = sum(1, na.rm = TRUE)) %>%
ungroup() %>%
select(permno, day, div_trailing, div_forward) %>%
collect()
@iangow @shizhe75
You actually say "the following code is fine …" (code in this context is a mass noun).
I'd be pretty confident that the issue is this line
With mostly data, it makes sense to require 12 months of date before and after. But exactly 365 days is going to be problematic, I'd guess. You may want to aggregate the data by month before calculating.
Alternatively, look at the distribution of
days_trailing
anddays_forward
before applying a filter.