jmboehm / GLFixedEffectModels.jl

Fast estimation of generalized linear models with high dimensional categorical variables in Julia
Other
33 stars 6 forks source link

Bump compat bounds #19

Closed nilshg closed 3 years ago

nilshg commented 3 years ago

This wasn't quite as straightforward as I'd hoped unfortunately - it seems to me the main issues were around the usage of CategoricalArrays, which I think has changed quite a bit between DataFrames 0.21 and 1.1. Mainly, it's not re-exported by DataFrames anymore (https://github.com/JuliaData/DataFrames.jl/pull/2404), so I added a couple of explicit using CategoricalArrays in places.

I then got a few errors in testing where categorical() complained that the argument passed was already a CategoricalVector (in the tests for clustered vcov matrices), so I changed those tests to not make the columns in the test data categoricals upfront.

I did have to cheat a bit to get the tests to pass - a couple of the estiamted covariance matrices are now only approximately similar to the hardcoded results at atol = 1e-3, and one of them only at 1e-2. Unfortunately I don't understand the package well enough to tell whether this is due to the changes I made around categorical inputs into the model.

One other thing I changed was the small regression that's being run during precompilation, which triggered the "Weights are not finite" error - I changed it to a similar small regression that works fine, but it might be worth considering why the original example doesn't work anymore.

On a related note, I'm a bit confused by the re-exports - the precompilation run in GLFixedEffectModels.jl uses Binomial() but GLM.LogitLink(), despite using GLM. StatsModels is re-exported presumably for the @formula macro, but I guess every user wants to have Binomial(), LogitLink() etc. available when using this package, so might it make sense to re-export GLM?

It'd be great if you could take a quick look to see if this is good enough or if something has to change. Performance of the package is great, with this PR I'm seeing:

julia> using DataFrames, GLM, GLFixedEffectModels

julia> n = 10_000_000;

julia> df = DataFrame(y = rand(n), x = rand(n), id = rand(1:2, n), t = rand(1:10, n));

julia> @time nlreg(df, @formula(y ~ x + fe(id) + fe(t)), Binomial(), LogitLink())
 16.717639 seconds (30.47 M allocations: 6.127 GiB, 5.35% gc time, 0.45% compilation time)
               Generalized Linear Fixed Effect Model
====================================================================
Distribution:          "Binomial"  Link:                 "LogitLink"
Number of obs:           10000000  Degrees of freedom:            13
Deviance:             3862610.810  Iterations:                     2
Converged:                   true
====================================================================
         Estimate  Std.Error   t value Pr(>|t|)  Lower 95% Upper 95%
--------------------------------------------------------------------
x    -0.000392201 0.00219091 -0.179013    0.858 -0.0046863 0.0039019
====================================================================

julia> @time nlreg(df, @formula(y ~ x + fe(id) + fe(t)), Binomial(), LogitLink())
  6.484966 seconds (299.22 k allocations: 4.343 GiB, 8.38% gc time, 2.88% compilation time)
               Generalized Linear Fixed Effect Model
====================================================================
Distribution:          "Binomial"  Link:                 "LogitLink"
Number of obs:           10000000  Degrees of freedom:            13
Deviance:             3862610.810  Iterations:                     2
Converged:                   true
====================================================================
         Estimate  Std.Error   t value Pr(>|t|)  Lower 95% Upper 95%
--------------------------------------------------------------------
x    -0.000392201 0.00219091 -0.179013    0.858 -0.0046863 0.0039019
====================================================================
jmboehm commented 3 years ago

Thanks for taking a stab at this. I looked at this on Friday night and came up with very similar conclusions.

nilshg commented 3 years ago

Thanks for reviewing! I've now re-exported GLM and restricted compat to FixedEffects <2.0.4, which was indeed the cause of those test failures on the vcov matrices. I've changed the tests back to 1e-4 as they are now passing again with FixedEffects 2.0.3.

Would you consider registering this package in the General registry? I'm using it in a project currently which I'll have to share ultimately so it would make life slightly easier if GLFixedEffectModels was a normal dependency from General.

jmboehm commented 3 years ago

Thanks. Not sure what the issue with the tests on travis is; try bumping the Julia version to 1.6 in .travis.yml or removing the Manifest.toml. This usually helps for me...

The package is registered in the General registry, is it not?

nilshg commented 3 years ago

Sorry, it is indeed - I went by the README, which says to add from the GitHub repo.

Re Travis, the failure is indeed completely unrelated to the package and seems to occur during the startup phase of the job. Travis is shutting down next week anyway, maybe worth migrating to GitHub Actions?

jmboehm commented 3 years ago

Sounds good. I'll merge this now, then fix the readme and look into how to migrate to Github Actions.