I haven’t thought much about when the warning should trigger, but I have done some work on potential metrics for “do warnings matter?”.
Option 1:
For each data frame we can set the true betas and maxT and censoring level, then calculate the number and proportion of observations at maxT and minT, and the number and proportion of warnings however we decide to calculate that. Then for each data frame we can calculate the bias in each of the estimated betas, and also whether we see coverage (i.e. “yes” or “no” for each covariate). I’ve seen various measures of bias, but I started simple with: bias = trueB – estimatedB, and pctbias = (true – estimated)/true 100. I also calculated absolutebias = |bias| and abspctbias = |pctbias| since I’ve seen that used as well. I simulated 1000 data frames for each maxT [100, 250, 500, 1000, 5000] and censoring level [.5, .95] and combined all of the results into a spreadsheet with 10005*19 = 95000 rows. We can then model the level of bias or pctbias or absolute bias in each covariate as: biasX1 = maxT + censored + warnings, or as biasX1 = maxT + censored + pctmaxT + pctminT. We can model coverage in a similar manner but with logistic regression to capture the yes/no dependent variable. I also initially wondered, given how you described it in your warning, whether “truncation” was increasing the number of ties in the data; warnings à more ties à increased bias in the estimated betas. We can explore this with biasX1 = maxT + censored + warnings + ties, although “warnings” and “ties” might be highly colinear. By the way, I’ve done all of the above with different true betas, and with different combinations of binary and continuous covariates.
Option 2:
I’m less sure about this approach, as it might reflect a serious ecological fallacy, but at least for the coverage rate it seems better than option 1. For each run of 1000 data frames we can set the true betas and maxT and censoring level, then for each set of 1000 runs we can calculate the average number and average proportion of observations at maxT and minT, the average coverage rate, and the average number and proportion of warnings however we decide to calculate that. Then for each run of 1000 data frames we can calculate the RMSE for each of the estimated betas. I simulated 1000 data frames for each maxT [100, 250, 500, 1000, 5000] and censoring level [.5, .95], calculated the average values for each of the various indicators, then and combined all of the results into a spreadsheet with 5*19 = 95 rows. We can then model the RMSE for each covariate as: RMSEX1 = maxT + censored + warnings, or as RMSEX1 = maxT + censored + pctmaxT + pctminT. We can model the coverage rate with CoverX1 = maxT + censored + warnings. We can explore the truncation issue with RMSEX1 = maxT + censored + warnings + ties, although “warnings” and “ties” might be highly colinear.
Let me know whether either of these options seem like an appropriate metric to address the “who cares about warnings” question. And it would be great to talk in mid-December.
From Jonathan Golub:
I haven’t thought much about when the warning should trigger, but I have done some work on potential metrics for “do warnings matter?”.
Option 1:
For each data frame we can set the true betas and maxT and censoring level, then calculate the number and proportion of observations at maxT and minT, and the number and proportion of warnings however we decide to calculate that. Then for each data frame we can calculate the bias in each of the estimated betas, and also whether we see coverage (i.e. “yes” or “no” for each covariate). I’ve seen various measures of bias, but I started simple with: bias = trueB – estimatedB, and pctbias = (true – estimated)/true 100. I also calculated absolutebias = |bias| and abspctbias = |pctbias| since I’ve seen that used as well. I simulated 1000 data frames for each maxT [100, 250, 500, 1000, 5000] and censoring level [.5, .95] and combined all of the results into a spreadsheet with 10005*19 = 95000 rows. We can then model the level of bias or pctbias or absolute bias in each covariate as: biasX1 = maxT + censored + warnings, or as biasX1 = maxT + censored + pctmaxT + pctminT. We can model coverage in a similar manner but with logistic regression to capture the yes/no dependent variable. I also initially wondered, given how you described it in your warning, whether “truncation” was increasing the number of ties in the data; warnings à more ties à increased bias in the estimated betas. We can explore this with biasX1 = maxT + censored + warnings + ties, although “warnings” and “ties” might be highly colinear. By the way, I’ve done all of the above with different true betas, and with different combinations of binary and continuous covariates.
Option 2:
I’m less sure about this approach, as it might reflect a serious ecological fallacy, but at least for the coverage rate it seems better than option 1. For each run of 1000 data frames we can set the true betas and maxT and censoring level, then for each set of 1000 runs we can calculate the average number and average proportion of observations at maxT and minT, the average coverage rate, and the average number and proportion of warnings however we decide to calculate that. Then for each run of 1000 data frames we can calculate the RMSE for each of the estimated betas. I simulated 1000 data frames for each maxT [100, 250, 500, 1000, 5000] and censoring level [.5, .95], calculated the average values for each of the various indicators, then and combined all of the results into a spreadsheet with 5*19 = 95 rows. We can then model the RMSE for each covariate as: RMSEX1 = maxT + censored + warnings, or as RMSEX1 = maxT + censored + pctmaxT + pctminT. We can model the coverage rate with CoverX1 = maxT + censored + warnings. We can explore the truncation issue with RMSEX1 = maxT + censored + warnings + ties, although “warnings” and “ties” might be highly colinear.
Let me know whether either of these options seem like an appropriate metric to address the “who cares about warnings” question. And it would be great to talk in mid-December.