jmboehm / GLFixedEffectModels.jl

Fast estimation of generalized linear models with high dimensional categorical variables in Julia
Other
33 stars 6 forks source link

Different results for binomial model from glm() and nlreg() #57

Open droodman opened 5 months ago

droodman commented 5 months ago

I'm getting different results where I think there should be a match, unless I'm doing something wrong. This uses the file https://www.stata-press.com/data/r18/lbw.dta.

julia> df = DataFrame(load("lbw.dta"));

julia> glm(@formula(low ~ age+lwt+smoke+ptl+ht+ui  + race ), df, Binomial(), LogitLink(), contrasts=Dict(:race=>DummyCoding()))
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

low ~ 1 + age + lwt + smoke + ptl + ht + ui + race

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      z  Pr(>|z|)    Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   0.461224   1.20457      0.38    0.7018  -1.8997       2.82215
age          -0.0271003  0.0364499   -0.74    0.4572  -0.0985408    0.0443402
lwt          -0.0151508  0.00692576  -2.19    0.0287  -0.0287251   -0.00157657
smoke         0.923345   0.400821     2.30    0.0212   0.13775      1.70894
ptl           0.541837   0.346247     1.56    0.1176  -0.136795     1.22047
ht            1.83252    0.691624     2.65    0.0081   0.47696      3.18808
ui            0.758513   0.459374     1.65    0.0987  -0.141843     1.65887
race: 2       1.26265    0.526405     2.40    0.0165   0.230912     2.29438
race: 3       0.862079   0.439147     1.96    0.0496   0.00136736   1.72279
──────────────────────────────────────────────────────────────────────────────

julia> nlreg(df, @formula(low ~ age+lwt+smoke+ptl+ht+ui  + fe(race)), Binomial(), LogitLink())
                Generalized Linear Fixed Effect Model
=====================================================================
Distribution:          "Binomial"   Link:                 "LogitLink"
Number of obs:                189   Degrees of freedom:             9
Deviance:                 734.954   Pseudo-R2:                 -2.132
Pseudo-Adj. R2:            -2.200   Iterations:                     8
Converged:                   true
=====================================================================
         Estimate  Std.Error  t value Pr(>|t|)  Lower 95%   Upper 95%
---------------------------------------------------------------------
age    -0.0138435  0.0371528 -0.37261    0.710 -0.0866616   0.0589746
lwt    -0.0128319 0.00695652 -1.84459    0.067 -0.0264665 0.000802576
smoke     1.06455   0.410814  2.59131    0.010   0.259366     1.86973
ptl      0.482117    0.34589  1.39385    0.165  -0.195814     1.16005
ht        2.12143   0.736751  2.87944    0.004   0.677426     3.56544
ui       0.837479   0.467677  1.79072    0.075 -0.0791521     1.75411
=====================================================================
jmboehm commented 4 months ago

Thanks for the report, and sorry that it took a while to respond. I'm trying to at least ensure that GLFixedEffectModels is not giving wrong answers, and your issue got me worried :)

The difference is due to GLFixedEffectModels.jl using different starting values than GLM.jl. Try the following, which yields for me the same estimates as GLM:

nlreg(df, @formula(low ~ age+lwt+smoke+ptl+ht+ui  + fe(race)), Binomial(), LogitLink(), start = zeros(Float64, 6))

Picking good starting values is not straightforward, see e.g. here . R's glm does something complicated, I believe. I haven't quite understood what GLM.jl is doing, but it seems closer to choosing zeros. I could change the default start values to zero, which would help in this situation, but it's quite possible that there will be other situations where the outcome is worse (right now it's using 0.1 .* ones(...), which is probably not a good idea).

droodman commented 4 months ago

Would it make sense to call GLM to pick the starting values? My cmp package for Stata does something like that. Before fitting a multi-equation model, where the equations could be linear, probit, tobit, etc., it runs the relevant Stata command on each equation separately to pick starting values for coefficients. Indeed, I might even make reghdfejl do that before calling GLFixedEffectModels.