harrelfe / rms

Regression Modeling Strategies
https://hbiostat.org/R/rms
Other
178 stars 49 forks source link

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables #52

Open Deleetdk opened 6 years ago

Deleetdk commented 6 years ago

I get a strange sounding error when trying to use validate() on a fitted ols:

Error in lsfit(x, y) : only 0 cases, but 2 variables

The dataset has n=1890 with about 400 predictors in the model. Almost all the predictors are dichotomous dummies indicating whether some regex pattern matched a name or not. Some of these only have a few true cases (but at least 10). This is a preliminary fit before I am doing some penalization to improve the model fit and final predictors (done with LASSO in glmnet). However, I wanted to validate the validity of the initial model. My guess is that the error occurs due to the resampling ending up with no cases for a given variable in the training set, which causes it to fail to fit / not be able to use that variable in the prediction in the test set.

For a reproducible example, here's a similar dataset based on iris:

#sim some data
iris2 = iris
set.seed(1)
iris2$letter1 = sample(letters, size = 150, replace = T)
iris2$letter2 = sample(letters, size = 150, replace = T)
iris2$letter3 = sample(letters, size = 150, replace = T)

#fit
(fit = rms::ols(Sepal.Width ~ letter1 + letter2 + letter3 + Petal.Width, Petal.Length, data = iris2, x = T, y = T))
validate(fit)

Gives:

Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 150 missing values deleted

The dataset has no missing data.

In my own simple cross-validation implementation discussed here, I got around this issue by simply ignoring runs that produce errors. See this question: https://stats.stackexchange.com/questions/213837/k-fold-cross-validation-nominal-predictor-level-appears-in-the-test-data-but-no Maybe this too should be done for rms?

harrelfe commented 6 years ago

Thanks for the report. There was a bug for ols for validate and calibrate where singular fits were reporting NAs instead of setting fail=TRUE so that that sample would be ignore. This is fixed for the next release.

Deleetdk commented 6 years ago

Updating to the Github version, validate no longer throws as error, but it gives useless output for my use case as all 40 runs failed:

> validate(ols_fit)

Divergence or singularity in 40 samples
          index.orig training test optimism index.corrected n
R-square       0.572      NaN  NaN      NaN             NaN 0
MSE            0.425      NaN  NaN      NaN             NaN 0
g              0.000      NaN  NaN      NaN             NaN 0
Intercept      0.000      NaN  NaN      NaN             NaN 0
Slope          1.000      NaN  NaN      NaN             NaN 0

In the iris example case, it is also almost useless. Despite 40 runs, only 2 completed:

> validate(fit)

Divergence or singularity in 38 samples
          index.orig training   test optimism index.corrected n
R-square      0.5504   0.8728 -0.931    1.804         -1.2536 2
MSE           0.0848   0.0234  0.364   -0.341          0.4258 2
g             0.3504   0.4573  0.191    0.266          0.0845 2
Intercept     0.0000   0.0000  2.177   -2.177          2.1766 2
Slope         1.0000   1.0000  0.289    0.711          0.2886 2

My guess is the same as before: one has to use special sampling to avoid the issue. As someone on Cross Validated suggested:

You could look into stratified sampling, i.e. constraining your train/test splits so that they have (approximately) the same relative frequencies for your predictor levels.

However, I think it worth considering whether the current behavior is actually wanted: So random splitting with non-negligible frequency results in sets that do not cover all predictor levels. Can you consider such a set representative for whatever the application is? I've been working with such small sample sizes and went for stratified splitting. But I insist that thinking hard about the data and the consequences of working with such small samples is at least as necessary as fixing the pure computational error.

harrelfe commented 6 years ago

The behavior you saw is the intended behavior when the sample size does not support a large number of parameters. You'll need to reduce the number of parameters in the model.

Deleetdk commented 6 years ago

How do you recommend that I validate models that contain a large number of logical predictors without running into this issue?

harrelfe commented 6 years ago

You have too many parameters in the model.

mirhassan121 commented 2 months ago

In addition: Warning messages: 1: In lsfit(x, y) : 16 missing values deleted 2: In lsfit(x, y) : 16 missing values deleted how solve this issue