issue with boundary estimation for regress "odds"

susannemay commented 1 month ago

When running a logistic regression model and making the mistake and use a factor(variable), where the variable is continuous (happened as part of a student project), rigr returns estimates when it should not. Below is an example using rigr and what Stata does by contrast.

Please let me know if you need any additional information.

Susanne (Susanne May, sjmay@uw.edu)

> m3<-regress("odds", formula=coclft~factor(canlft)*factor(age45)+factor(sex)+factor(insure)+factor(smoke100)+factor(educyrs), data=data)
> m3
( 1  cases deleted due to missing values)

Call:
regress(fnctl = "odds", formula = coclft ~ factor(canlft) * factor(age45) + 
    factor(sex) + factor(insure) + factor(smoke100) + factor(educyrs), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8837  -1.1331   0.6981   1.0200   2.0141  

Coefficients:

Raw Model:
                                     Estimate  Naive SE  Robust SE       F stat    df Pr(>F)   
[1]  Intercept                         14.90      1455     1.458            104.49  1 < 0.00005
[2]  factor(canlft)                    1.258     0.2901    0.2979            17.82  1 < 0.00005
[3]  factor(age45)                    -0.2149    0.5460    0.5509             0.15  1   0.6967 
[4]  factor(sex)                       0.4936    0.2074    0.2084             5.61  1   0.0182 
[5]  factor(insure)                   -0.4891    0.2111    0.2228             4.82  1   0.0286 
[6]  factor(smoke100)                  1.153     0.3890    0.4009             8.28  1   0.0042 
     factor(educyrs)                                                         74.48 14 < 0.00005
[7]     factor(educyrs)6              -1.747      2058     1.694              1.06  1   0.3030 
[8]     factor(educyrs)8              -16.73      1455     1.455            132.22  1 < 0.00005
[9]     factor(educyrs)9              -16.98      1455     1.408            145.50  1 < 0.00005
[10]    factor(educyrs)10             -17.21      1455     1.368            158.34  1 < 0.00005
[11]    factor(educyrs)11             -16.93      1455     1.401            145.96  1 < 0.00005
[12]    factor(educyrs)12             -16.68      1455     1.302            164.11  1 < 0.00005
[13]    factor(educyrs)13             -16.93      1455     1.438            138.63  1 < 0.00005
[14]    factor(educyrs)14             -16.57      1455     1.384            143.50  1 < 0.00005
[15]    factor(educyrs)15             -16.52      1455     1.431            133.38  1 < 0.00005
[16]    factor(educyrs)16             -16.20      1455     1.429            128.51  1 < 0.00005
[17]    factor(educyrs)17             -16.15      1455     1.475            119.82  1 < 0.00005
[18]    factor(educyrs)18             -1.258      1782     1.488              0.71  1   0.3983 
[19]    factor(educyrs)19             -16.82      1455     1.967             73.17  1 < 0.00005
[20]    factor(educyrs)21             -32.39      2058     1.650            385.23  1 < 0.00005
[21] factor(canlft):factor(age45)      1.168     0.6335    0.6478             3.25  1   0.0718 

Transformed Model:
                                     e(Est)     e(95%L)    e(95%H)          F stat    df Pr(>F)   
[1]  Intercept                        2.963e+06  1.691e+05  5.193e+07          104.49  1 < 0.00005
[2]  factor(canlft)                     3.517      1.959      6.315             17.82  1 < 0.00005
[3]  factor(age45)                     0.8067     0.2733      2.380              0.15  1   0.6967 
[4]  factor(sex)                        1.638      1.088      2.467              5.61  1   0.0182 
[5]  factor(insure)                    0.6132     0.3958     0.9500              4.82  1   0.0286 
[6]  factor(smoke100)                   3.169      1.442      6.964              8.28  1   0.0042 
     factor(educyrs)                                                            74.48 14 < 0.00005
[7]     factor(educyrs)6               0.1744    6.256e-03    4.859              1.06  1   0.3030 
[8]     factor(educyrs)8              5.437e-08  3.121e-09  9.47e-07           132.22  1 < 0.00005
[9]     factor(educyrs)9              4.21e-08   2.65e-09   6.69e-07           145.50  1 < 0.00005
[10]    factor(educyrs)10             3.342e-08  2.275e-09  4.91e-07           158.34  1 < 0.00005
[11]    factor(educyrs)11             4.446e-08  2.835e-09  6.971e-07          145.96  1 < 0.00005
[12]    factor(educyrs)12             5.713e-08  4.428e-09  7.37e-07           164.11  1 < 0.00005
[13]    factor(educyrs)13             4.437e-08  2.633e-09  7.478e-07          138.63  1 < 0.00005
[14]    factor(educyrs)14             6.337e-08  4.184e-09  9.599e-07          143.50  1 < 0.00005
[15]    factor(educyrs)15             6.666e-08  4.012e-09  1.108e-06          133.38  1 < 0.00005
[16]    factor(educyrs)16             9.168e-08  5.531e-09  1.52e-06           128.51  1 < 0.00005
[17]    factor(educyrs)17             9.678e-08  5.334e-09  1.756e-06          119.82  1 < 0.00005
[18]    factor(educyrs)18              0.2843     0.01530     5.286              0.71  1   0.3983 
[19]    factor(educyrs)19             4.938e-08  1.037e-09  2.352e-06           73.17  1 < 0.00005
[20]    factor(educyrs)21             8.576e-15  3.354e-16  2.193e-13          385.23  1 < 0.00005
[21] factor(canlft):factor(age45)       3.217     0.9011      11.48              3.25  1   0.0718 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 784.91  on 568  degrees of freedom
Residual deviance: 709.96  on 548  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 751.96

Number of Fisher Scoring iterations: 14

> lincom(mod1, c(0, 1, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)) 
Error in "uRegress" %in% class(reg) : object 'mod1' not found
> lincom(m3, c(0, 1, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)) 

H0: 1*factor(canlft)1+1*factor(canlft)1:factor(age45)1   =  0 
Ha: 1*factor(canlft)1+1*factor(canlft)1:factor(age45)1  !=  0 
      e(Est) Std. Err. e(95%L) e(95%H)     T Pr(T > |t|)    
[1,] 11.3134    0.5785  3.6314 35.2459 4.194     3.2e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> lincom(m3, c(0, 1, 2, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)) 

H0: 1*factor(canlft)1+2*factor(age45)1+1*factor(canlft)1:factor(age45)1   =  0 
Ha: 1*factor(canlft)1+2*factor(age45)1+1*factor(canlft)1:factor(age45)1  !=  0 
      e(Est) Std. Err. e(95%L) e(95%H)     T Pr(T > |t|)  
[1,]  7.3615    0.7969  1.5387 35.2184 2.505      0.0125 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> lincom(m3, c(0, 1, 1, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)) 

H0: 1*factor(canlft)1+1*factor(age45)1+1*factor(canlft)1:factor(age45)1   =  0 
Ha: 1*factor(canlft)1+1*factor(age45)1+1*factor(canlft)1:factor(age45)1  !=  0 
      e(Est) Std. Err. e(95%L) e(95%H)     T Pr(T > |t|)    
[1,]  9.1260    0.4258  3.9537 21.0645 5.192    2.93e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

STATA output

. tab educyrs

   AEEDCPYR |      Freq.     Percent        Cum.
------------+-----------------------------------
          4 |          1        0.18        0.18
          6 |          1        0.18        0.35
          8 |         12        2.11        2.46
          9 |         23        4.04        6.49
         10 |         30        5.26       11.75
         11 |         56        9.82       21.58
         12 |        228       40.00       61.58
         13 |         78       13.68       75.26
         14 |         80       14.04       89.30
         15 |         17        2.98       92.28
         16 |         32        5.61       97.89
         17 |          7        1.23       99.12
         18 |          2        0.35       99.47
         19 |          2        0.35       99.82
         21 |          1        0.18      100.00
------------+-----------------------------------
      Total |        570      100.00

. logistic coclft i.canlft##i.age45 sex smoke100 i.insure i.educyrs
note: 4.educyrs != 0 predicts success perfectly
      4.educyrs dropped and 1 obs not used

note: 6.educyrs != 0 predicts success perfectly
      6.educyrs dropped and 1 obs not used

note: 18.educyrs != 0 predicts success perfectly
      18.educyrs dropped and 2 obs not used

note: 21.educyrs != 0 predicts failure perfectly
      21.educyrs dropped and 1 obs not used

note: 19.educyrs omitted because of collinearity

Logistic regression                             Number of obs     =        564
                                                LR chi2(16)       =      68.47
                                                Prob > chi2       =     0.0000
Log likelihood = -354.98136                     Pseudo R2         =     0.0880

------------------------------------------------------------------------------
      coclft | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.canlft |   3.517022   1.020284     4.34   0.000     1.991785     6.21023
     1.age45 |   .8066514   .4404418    -0.39   0.694     .2766426    2.352083
             |
canlft#age45 |
        1 1  |    3.21676   2.037728     1.84   0.065     .9294073    11.13349
             |
         sex |   1.638129   .3397451     2.38   0.017     1.090964    2.459721
    smoke100 |   3.168737    1.23273     2.96   0.003     1.478238    6.792475
    1.insure |   .6132005   .1294229    -2.32   0.020     .4054589    .9273809
             |
     educyrs |
          4  |          1  (empty)
          6  |          1  (empty)
          8  |   1.100961   1.712434     0.06   0.951     .0522173    23.21291
          9  |   .8526251   1.264489    -0.11   0.914     .0465994    15.60041
         10  |   .6768345   .9926229    -0.27   0.790     .0382066     11.9902
         11  |   .9002385   1.300783    -0.07   0.942     .0530187    15.28572
         12  |   1.156832   1.647657     0.10   0.919     .0709444     18.8635
         13  |   .8985245   1.293251    -0.07   0.941     .0535046     15.0893
         14  |   1.283299   1.847665     0.17   0.862      .076346    21.57093
         15  |   1.349889   2.047701     0.20   0.843     .0690361    26.39489
         16  |     1.8566   2.727956     0.42   0.674     .1042374     33.0684
         17  |    1.95979   3.321746     0.40   0.691     .0707075    54.31922
         18  |          1  (empty)
         19  |          1  (omitted)
         21  |          1  (empty)
             |
       _cons |    .146331   .2194055    -1.28   0.200     .0077457    2.764457
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

adw96 commented 1 month ago

@svteichman What does lm() do by default here? I'd like for rigr to mimic lm's behaviour 🙏

svteichman commented 1 month ago

From a quick toy example, it seems that lm() and glm() will use the variable as a factor if it is input that way, even if the original variable appears to be continuous. Here is an example below:

> df <- data.frame(y = rbinom(20, 1, 0.7), x = 1:20)
> mod <- glm(y ~ x, data = df, family = "binomial")
> coef(mod)
 (Intercept)            x 
 0.471398179 -0.006266904 
> mod_factor <- glm(y ~ as.factor(x), data = df, family = "binomial")
> coef(mod_factor)
   (Intercept)  as.factor(x)2  as.factor(x)3  as.factor(x)4  as.factor(x)5  as.factor(x)6 
  2.556607e+01  -5.113214e+01   2.202667e-06  -5.113214e+01   2.205239e-06   2.201228e-06 
 as.factor(x)7  as.factor(x)8  as.factor(x)9 as.factor(x)10 as.factor(x)11 as.factor(x)12 
 -5.113214e+01   2.304767e-06  -5.113214e+01   2.201405e-06   2.201662e-06  -5.113214e+01 
as.factor(x)13 as.factor(x)14 as.factor(x)15 as.factor(x)16 as.factor(x)17 as.factor(x)18 
 -5.113214e+01   2.206395e-06   2.202988e-06   2.307846e-06  -6.726091e-09  -5.113214e+01 
as.factor(x)19 as.factor(x)20 
  2.202157e-06  -5.113214e+01

adw96 commented 1 month ago

Thanks @svteichman! So confirming that regress and lm are aligned, even if different to STATA?

PS. just confirming the above behaviour still applies even for non-integer x?

svteichman commented 1 month ago

Yes!

Here we have a non-integer x and I directly compare to rigr::regress() and it gives the same results as glm() when given the non-integer variable x coerced into a factor.

> df <- data.frame(y = rbinom(20, 1, 0.7), x = seq(0.1, 2, 0.1))
> mod <- glm(y ~ x, data = df, family = "binomial")
> coef(mod)
(Intercept)           x 
  1.0100427  -0.3660353 
> mod_factor <- glm(y ~ as.factor(x), data = df, family = "binomial")
> coef(mod_factor)
    (Intercept) as.factor(x)0.2 as.factor(x)0.3 as.factor(x)0.4 as.factor(x)0.5 
   2.556607e+01    2.307129e-06    2.304752e-06    2.307313e-06   -1.717587e-09 
as.factor(x)0.6 as.factor(x)0.7 as.factor(x)0.8 as.factor(x)0.9   as.factor(x)1 
  -5.113214e+01   -5.113214e+01   -5.113214e+01   -1.399749e-09    2.303245e-06 
as.factor(x)1.1 as.factor(x)1.2 as.factor(x)1.3 as.factor(x)1.4 as.factor(x)1.5 
  -5.113214e+01   -5.113214e+01    9.584600e-08   -1.992438e-09    2.306903e-06 
as.factor(x)1.6 as.factor(x)1.7 as.factor(x)1.8 as.factor(x)1.9   as.factor(x)2 
   9.507097e-08   -5.113214e+01   -5.113214e+01    2.303539e-06    2.303539e-06 
> regress_mod_factor <- regress("odds", y ~ as.factor(x), data = df)
Warning messages:
1: In stats::pf(LRStat, p - intercept, n - p) : NaNs produced
2: In stats::qt((1 - conf.level)/2, df = n - p) : NaNs produced
> regress_mod_factor$coefficients[,1]
    (Intercept) as.factor(x)0.2 as.factor(x)0.3 as.factor(x)0.4 as.factor(x)0.5 
   2.556607e+01    2.307129e-06    2.304752e-06    2.307313e-06   -1.717587e-09 
as.factor(x)0.6 as.factor(x)0.7 as.factor(x)0.8 as.factor(x)0.9   as.factor(x)1 
  -5.113214e+01   -5.113214e+01   -5.113214e+01   -1.399749e-09    2.303245e-06 
as.factor(x)1.1 as.factor(x)1.2 as.factor(x)1.3 as.factor(x)1.4 as.factor(x)1.5 
  -5.113214e+01   -5.113214e+01    9.584600e-08   -1.992438e-09    2.306903e-06 
as.factor(x)1.6 as.factor(x)1.7 as.factor(x)1.8 as.factor(x)1.9   as.factor(x)2 
   9.507097e-08   -5.113214e+01   -5.113214e+01    2.303539e-06    2.303539e-06

adw96 commented 1 month ago

Thanks @svteichman.

@susannemay even though rigr does something different to STATA, it does the same thing as R's lm, which is its natural comparator. For this reason, I don't see this as an issue that needs fixing so I am going to close.

statdivlab / rigr

issue with boundary estimation for regress "odds" #155