Open Deleetdk opened 6 years ago
Thanks for the report. There was a bug for ols
for validate
and calibrate
where singular fits were reporting NAs instead of setting fail=TRUE
so that that sample would be ignore. This is fixed for the next release.
Updating to the Github version, validate
no longer throws as error, but it gives useless output for my use case as all 40 runs failed:
> validate(ols_fit)
Divergence or singularity in 40 samples
index.orig training test optimism index.corrected n
R-square 0.572 NaN NaN NaN NaN 0
MSE 0.425 NaN NaN NaN NaN 0
g 0.000 NaN NaN NaN NaN 0
Intercept 0.000 NaN NaN NaN NaN 0
Slope 1.000 NaN NaN NaN NaN 0
In the iris
example case, it is also almost useless. Despite 40 runs, only 2 completed:
> validate(fit)
Divergence or singularity in 38 samples
index.orig training test optimism index.corrected n
R-square 0.5504 0.8728 -0.931 1.804 -1.2536 2
MSE 0.0848 0.0234 0.364 -0.341 0.4258 2
g 0.3504 0.4573 0.191 0.266 0.0845 2
Intercept 0.0000 0.0000 2.177 -2.177 2.1766 2
Slope 1.0000 1.0000 0.289 0.711 0.2886 2
My guess is the same as before: one has to use special sampling to avoid the issue. As someone on Cross Validated suggested:
You could look into stratified sampling, i.e. constraining your train/test splits so that they have (approximately) the same relative frequencies for your predictor levels.
However, I think it worth considering whether the current behavior is actually wanted: So random splitting with non-negligible frequency results in sets that do not cover all predictor levels. Can you consider such a set representative for whatever the application is? I've been working with such small sample sizes and went for stratified splitting. But I insist that thinking hard about the data and the consequences of working with such small samples is at least as necessary as fixing the pure computational error.
The behavior you saw is the intended behavior when the sample size does not support a large number of parameters. You'll need to reduce the number of parameters in the model.
How do you recommend that I validate models that contain a large number of logical predictors without running into this issue?
You have too many parameters in the model.
In addition: Warning messages: 1: In lsfit(x, y) : 16 missing values deleted 2: In lsfit(x, y) : 16 missing values deleted how solve this issue
I get a strange sounding error when trying to use
validate()
on a fittedols
:Error in lsfit(x, y) : only 0 cases, but 2 variables
The dataset has n=1890 with about 400 predictors in the model. Almost all the predictors are dichotomous dummies indicating whether some regex pattern matched a name or not. Some of these only have a few true cases (but at least 10). This is a preliminary fit before I am doing some penalization to improve the model fit and final predictors (done with LASSO in glmnet). However, I wanted to validate the validity of the initial model. My guess is that the error occurs due to the resampling ending up with no cases for a given variable in the training set, which causes it to fail to fit / not be able to use that variable in the prediction in the test set.
For a reproducible example, here's a similar dataset based on iris:
Gives:
The dataset has no missing data.
In my own simple cross-validation implementation discussed here, I got around this issue by simply ignoring runs that produce errors. See this question: https://stats.stackexchange.com/questions/213837/k-fold-cross-validation-nominal-predictor-level-appears-in-the-test-data-but-no Maybe this too should be done for rms?