Weird bug with bandwidth selection

paulgp commented 3 years ago

Running into the following error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases

This is showing up in subsets of RDHonest estimation where the data is relatively sparse, and the bandwidth is being selected using RDOptBW. If I specify the command with a bandwidth, there are no errors.

rdh.all <- RDHonest(outcome ~ age, data = reg_data, kern = "uniform", weight=pop, opt.criterion = "MSE", M = B_reg, cutoff = 65)

(I assume this is an issue with cross-validation, but not sure.

kolesarm commented 3 years ago

This looks like RDOptBW is trying a bandwidth that's too small, but I am not sure how it's happening since the smallest bandwidth that the function tries is chosen so the regression can always be run: https://github.com/kolesarm/RDHonest/blob/711d70fc00c350253a00f5f769989356841057f0/R/NPR_lp.R#L168-L175

Two questions:

Are there missing values in your data?
What is the exact command you ran? If you run traceback() immediately after you get the error, what does it return?

paulgp commented 3 years ago

No missing vals -- I specifically filter those out. It's particularly with a very sparse setting. Here's the skimr for one that fails:


── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate   min   max empty n_unique whitespace
1 state_name            0             1     6     6     0        1          0
2 state_abbrev          0             1     2     2     0        1          0
3 variable              0             1    10    10     0        1          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist 1 region 0 1 4 0 4 4 4 4 4 ▁▁▇▁▁ 2 year 0 1 2014. 2.75 2008 2012 2014 2016 2018 ▃▅▇▇▆ 3 age 0 1 60.4 7.11 51 55 60 64 79 ▇▇▃▂▂ 4 new_race 0 1 2 0 2 2 2 2 2 ▁▁▇▁▁ 5 outcome 0 1 0.845 0.362 0 1 1 1 1 ▂▁▁▁▇ 6 pop 0 1 373. 376. 13.7 131. 265. 475. 2553. ▇▂▁▁▁


2. Here's an example:

RDHonest(outcome ~ age, data = test, kern = "uniform",
weight=pop, opt.criterion = "MSE", M = B_reg, cutoff = 65) Error in if (h["m"] <= 0) 0 * d$Xm else K(d$Xm/h["m"]) : missing value where TRUE/FALSE needed

traceback() 4: NPRreg.fit(d, h1, se.method = se.initial) 3: NPRPrelimVar.fit(d, se.initial = se.initial) 2: NPRHonest.fit(d, M, kern, opt.criterion = opt.criterion, bw.equal = bw.equal, alpha = alpha, beta = beta, se.method = se.method, J = J, sclass = sclass, order = order, se.initial = se.initial) 1: RDHonest(outcome ~ age, data = test, kern = "uniform", weight = pop, opt.criterion = "MSE", M = B_reg, cutoff = 65)

kolesarm commented 3 years ago

Sorry for the late reply. It looks like the issue is that in computing a preliminary variance estimate, the package uses the IK bandwidth. In this case, the IK bandwidth returns (I am guessing) a NaN---perhaps because there is no variation in the outcome in one of the preliminary estimation windows, or because there is no data in there.

In commit 6de8ae1, I have changed the code so that the bandwidth is reset to Inf in such cases. In those cases, one can use "nn" for se.initial, rather than "EHW", I think that should be more stable in these situations. Does it resolve the issue?

kolesarm / RDHonest

Weird bug with bandwidth selection #4