kolesarm / RDHonest

Honest inference in regression discontinuity designs
56 stars 15 forks source link

Weird bug with bandwidth selection #4

Closed paulgp closed 3 years ago

paulgp commented 3 years ago

Running into the following error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases

This is showing up in subsets of RDHonest estimation where the data is relatively sparse, and the bandwidth is being selected using RDOptBW. If I specify the command with a bandwidth, there are no errors.

rdh.all <- RDHonest(outcome ~ age, data = reg_data, kern = "uniform", weight=pop, opt.criterion = "MSE", M = B_reg, cutoff = 65)

(I assume this is an issue with cross-validation, but not sure.

kolesarm commented 3 years ago

This looks like RDOptBW is trying a bandwidth that's too small, but I am not sure how it's happening since the smallest bandwidth that the function tries is chosen so the regression can always be run: https://github.com/kolesarm/RDHonest/blob/711d70fc00c350253a00f5f769989356841057f0/R/NPR_lp.R#L168-L175

Two questions:

  1. Are there missing values in your data?
  2. What is the exact command you ran? If you run traceback() immediately after you get the error, what does it return?
paulgp commented 3 years ago
  1. No missing vals -- I specifically filter those out. It's particularly with a very sparse setting. Here's the skimr for one that fails:
    
    ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────
    skim_variable n_missing complete_rate   min   max empty n_unique whitespace
    1 state_name            0             1     6     6     0        1          0
    2 state_abbrev          0             1     2     2     0        1          0
    3 variable              0             1    10    10     0        1          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist 1 region 0 1 4 0 4 4 4 4 4 ▁▁▇▁▁ 2 year 0 1 2014. 2.75 2008 2012 2014 2016 2018 ▃▅▇▇▆ 3 age 0 1 60.4 7.11 51 55 60 64 79 ▇▇▃▂▂ 4 new_race 0 1 2 0 2 2 2 2 2 ▁▁▇▁▁ 5 outcome 0 1 0.845 0.362 0 1 1 1 1 ▂▁▁▁▇ 6 pop 0 1 373. 376. 13.7 131. 265. 475. 2553. ▇▂▁▁▁


2. Here's an example:

RDHonest(outcome ~ age, data = test, kern = "uniform",
weight=pop, opt.criterion = "MSE", M = B_reg, cutoff = 65) Error in if (h["m"] <= 0) 0 * d$Xm else K(d$Xm/h["m"]) : missing value where TRUE/FALSE needed

traceback() 4: NPRreg.fit(d, h1, se.method = se.initial) 3: NPRPrelimVar.fit(d, se.initial = se.initial) 2: NPRHonest.fit(d, M, kern, opt.criterion = opt.criterion, bw.equal = bw.equal, alpha = alpha, beta = beta, se.method = se.method, J = J, sclass = sclass, order = order, se.initial = se.initial) 1: RDHonest(outcome ~ age, data = test, kern = "uniform", weight = pop, opt.criterion = "MSE", M = B_reg, cutoff = 65)

kolesarm commented 3 years ago

Sorry for the late reply. It looks like the issue is that in computing a preliminary variance estimate, the package uses the IK bandwidth. In this case, the IK bandwidth returns (I am guessing) a NaN---perhaps because there is no variation in the outcome in one of the preliminary estimation windows, or because there is no data in there.

In commit 6de8ae1, I have changed the code so that the bandwidth is reset to Inf in such cases. In those cases, one can use "nn" for se.initial, rather than "EHW", I think that should be more stable in these situations. Does it resolve the issue?