jmboehm / GLFixedEffectModels.jl

Fast estimation of generalized linear models with high dimensional categorical variables in Julia
Other
33 stars 6 forks source link

Matrix multiplication errors when fitting model #25

Closed nilshg closed 2 years ago

nilshg commented 3 years ago

I ran into this odd issue today: Consider the data set

julia> using CSV, DataFrames, GLFixedEffectModels

julia> datastring = "Outcome,Treatment,Group\nfalse,New,IL\nfalse,Old,IL\nfalse,Old,EL\nfalse,Old,IL\nfalse,New,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,New,EL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,New,IL\nfalse,Old,IL\nfalse,Old,IL\nfalse,Old,IL\n";

julia> data = CSV.read(IOBuffer(datastring), DataFrame)
30×3 DataFrame 
 Row │ Outcome  Treatment  Group       
     │ Bool     String     String 
─────┼────────────────────────────
   1 │   false  New        IL
   2 │   false  Old        IL
   3 │   false  Old        EL
   4 │   false  Old        IL
   5 │   false  New        IL
   6 │   false  Old        IL
   7 │   false  Old        IL
   8 │   false  Old        IL
   9 │   false  Old        IL
  10 │   false  Old        IL
  11 │   false  Old        IL
  12 │   false  Old        IL
  13 │   false  Old        IL
  14 │   false  Old        IL
  15 │   false  Old        IL
  16 │   false  Old        IL
  17 │   false  Old        IL
  18 │   false  Old        IL
  19 │   false  Old        IL
  20 │   false  Old        IL
  21 │   false  Old        IL
  22 │   false  Old        IL
  23 │   false  New        EL
  24 │   false  Old        IL
  25 │   false  Old        IL
  26 │   false  Old        IL
  27 │   false  New        IL
  28 │   false  Old        IL
  29 │   false  Old        IL
  30 │   false  Old        IL

I appreciate that this is degenerate as all outcomes are zero, but the issue is the same in my full data (which has about 0.3% positives).

Running a regression on this is fine:

julia> nlreg(data,
             @formula(Outcome ~ Treatment + fe(Group)),
             Binomial(), LogitLink())
                   Generalized Linear Fixed Effect Model
===========================================================================
Distribution:             "Binomial"   Link:                    "LogitLink"
Number of obs:                    30   Degrees of freedom:                4
Deviance:                      7.579   Iterations:                        1
Converged:                      true
===========================================================================
                Estimate Std.Error    t value Pr(>|t|)  Lower 95% Upper 95%
---------------------------------------------------------------------------
Treatment: New       0.1 2.73971e7 3.65002e-9    1.000 -5.36973e7 5.36973e7
Treatment: Old       0.1 2.73971e7 3.65002e-9    1.000 -5.36973e7 5.36973e7
===========================================================================

However for subsets of the data, I get a matrix multiplication error:

julia> nlreg(data[1:12, :],             
            @formula(Outcome ~ Treatment + fe(Group)),             
            Binomial(), LogitLink())ERROR: DimensionMismatch("second dimension of A, 2, does not match length of x, 1")
Stacktrace:
 [1] gemv!(y::Vector{Float64}, tA::Char, A::Matrix{Float64}, x::Vector{Float64}, α::Bool, β::Bool)
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:530      
 [2] mul!
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:66 [inlined]
 [3] mul!
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:275 [inlined]
 [4] *(A::Matrix{Float64}, x::Vector{Float64})
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\matmul.jl:47       
 [5] nlreg(df::Any, formula::FormulaTerm, distribution::Binomial{Float64}, link::LogitLink, vcov::CovarianceEstimator; weights::Union{Nothing, Symbol}, subset::Union{Nothing, AbstractVector{T} where T}, start::Union{Nothing, AbstractVector{T} where T}, maxiter_center::Int64, maxiter::Int64, contrasts::Dict{Symbol, Any}, dof_add::Int64, save::Union{Bool, Symbol}, method::Symbol, drop_singletons::Bool, double_precision::Bool, dev_tol::Float64, rho_tol::Float64, step_tol::Float64, center_tol::Float64, vcovformula::Union{Nothing, Expr, Symbol}, subsetformula::Union{Nothing, Expr, Symbol}, verbose::Bool)
   @ GLFixedEffectModels C:\Users\ngudat\.julia\packages\GLFixedEffectModels\isPxH\src\fit.jl:281
 [6] nlreg (repeats 2 times)
   @ C:\Users\ngudat\.julia\packages\GLFixedEffectModels\isPxH\src\fit.jl:61 [inlined]
 [7] top-level scope
   @ REPL[7]:1

Now interestingly this happens seemingly at random for some subsets of the data, e.g. using data[1:13, :] is fine, so is data[1:14, :], while data[1:15, :]. Initially I had thought this was related to some levels not being present in the data, and that somehow messing up the dimensions of a matrix in the fitting procedure, but clearly in the above examples there's no additional levels in the data when going from rows 1:12 to rows 1:13.

Any ideas what might be causing this? I'm getting this error on my full data set (55m rows) so it's not an issue that's due to me generating a weird test data set.

nilshg commented 3 years ago

I have traced this down some of the way: it appears the issue is in the basecol function, so presumably something in the factorization. When using data[1:14, :] from the above data, the Xdemean produced by FixedEffects.solve_residuals! is:

Xdemean1 = [0.42254862724499154 -0.4225486272449915; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
            0.42254862724499154 -0.4225486272449915; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303; 
           -0.076827023135453 0.07682702313545303]

while with rows 1 to 15 it is:

Xdemean2 = [0.4280362717546667 -0.4280362717546668; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
            0.4280362717546667 -0.4280362717546668; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772; 
           -0.07133937862577783 0.07133937862577772]

and I get:

julia> GLFixedEffectModels.basecol(Xdemean1)
2-element BitVector:
 1
 1

julia> GLFixedEffectModels.basecol(Xdemean2)
2-element BitVector:
 0
 1

the zero that's produced here means getcols will drop one column from Xdemean, which turns into Xhat, which in turn is used in calculating beta_update and ultimately the multiplication in eta_update = nu_orig - (nudemean - Xdemean * beta_update) ./ sqrtw fails.

I thought there might be an issue with rank deficiency, but this is actually present for both of the Xdemean matrices above:

julia> using LinearAlgebra

julia> rank(Xdemean1)
1

julia> rank(Xdemean2)
1

which might be obvious from just looking at them as

julia> all(Xdemean2[:, 1] .≈ -1 .* Xdemean2[:, 2])
true

julia> all(Xdemean1[:, 1] .≈ -1 .* Xdemean1[:, 2])
true

I noticed that the code has a "TO DO" notice about potential roundoff error, not sure whether this could play a role here?

I also tried replicating this in alpaca in R, but it looks like alpaca complains when not being passed a left hand side which doesn't have both trues and falses. Setting the first entry of Outcome in the data above to 1, alpaca will produce estimates from the data (tried with between 10 and 30 rows).

jmboehm commented 3 years ago

Hi Nils,

Apologies for taking a while to get back to you on this. I've encountered this issue mostly in the context of statistical separation, see e.g. https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/separation_primer.md . For logit models with fixed effects that's relatively common. I've merged a commit that includes some thresholds (on mu) to detect separation. Could you see whether the package (when you use master) correctly gives you warnings when you set, for example, separation_mu_lbound = 1e-15 and separation_mu_ubound = 1.0 - 1e-15? You can also see whether the the estimation converges when you set separation = :mu.

The package isn't very good yet at detecting separation, more work is needed here.

jmboehm commented 2 years ago

Thanks to @caibengbu's work on separation detection, this no longer errors in v.0.4.0 and gives:

julia> nlreg(data[1:12, :],             
                   @formula(Outcome ~ Treatment + fe(Group)),             
                   Binomial(), LogitLink())
[ Info: 1 observations detected as singletons. Dropping them ...
[ Info: Multicollinearity detected among columns of X and FixedEffects. Dropping regressors: Treatment: Old
                    Generalized Linear Fixed Effect Model                    
==============================================================================
Distribution:               "Binomial"  Link:                      "LogitLink"
Number of obs:                      11  Degrees of freedom:                  2
Deviance:                        0.000  Iterations:                         13
Converged:                        true  
==============================================================================
                   Estimate Std.Error     t value Pr(>|t|) Lower 95% Upper 95%
------------------------------------------------------------------------------
Treatment: New  -0.00417672   576.467 -7.24536e-6    1.000  -1129.86   1129.85
Treatment: Old          0.0       NaN         NaN      NaN       NaN       NaN
==============================================================================

I don't know whether that's what you expected, but I'm closing this unless you feel that the problem is not sufficiently addressed. Please feel free to re-open in that case.

caibengbu commented 2 years ago

To add to @jmboehm 's comment, the problem was mainly due to collinearity. The full sample regression nlreg(data, @formula(Outcome ~ Treatment + fe(Group)), Binomial(), LogitLink()) was not run properly either: the estimate was [0.1 0.1] and was very likely the initial point. The algorithm converged prematurely. This is because of the failure to detect collinearity, the same reason as why the sub-sample regression failed (As for when the detection failure will result in a DimensionMismatch error or a premature convergence, I am not sure).

In version 0.3.0, this error will not pop up when the string variable is replaced with a boolean variable. i.e. data.Treatment_dummy = data.Treatment.=="New". Doing this is equivalent to dropping one level of Treatment. I think this generates the result closer to what you wanted.

If we were to consider the problem of separation, the example dataset is completely separated by FE, because every outcome falls at the same separation point, which is false.

Screen Shot 2022-05-20 at 5 47 25 PM