iamlemec / fastreg

Fast sparse regressions with advanced formula syntax. OLS, GLM, Poisson, Maxlike, and more. High-dimensional fixed effects.
MIT License
58 stars 10 forks source link

Is fr.negbin, fr.poisson and fr.zinf_negbin implemented? #2

Open vincLohm opened 1 year ago

vincLohm commented 1 year ago

I don't see any module called negbin, poisson or zinf_negbin in this version. However the readme says these glm methods are implemented? Atm I'm using fr.design_matrices for the nice FE nomenclature and then going back to statsmodels. Totally fine if it's not implemented. Just confused by the readme :)

iamlemec commented 1 year ago

Ah yeah, you need to have jax installed to use the GLM routines (to get the derivatives and do gradient descent). If it's not installed, those functions won't show up. They're all implemented in general.py and are mostly just wrappers around the glm routine. Also, the "zinf" ones should be poisson_zinf and negbin_zinf, just fixed those in the README and added a note about installing jax. Thanks!

vincLohm commented 1 year ago

Thanks a lot for the quick reply ! I've now installed jax[cpu] and jaxlib via pip (not sure I want pip to handle cuda installation yet), and i'm importing jax and jaxlib before fastreg, but it still doesn't work. Sorry, I've never heard of jax (I have an economics and math background), and I apologise if I'm being super dumb here.

iamlemec commented 1 year ago

Hmm, interesting. What's the error message when you try accessing one of the GLM routines like fr.glm? Also what's the value of fr.HAS_JAX? That should be True when it's working.

vincLohm commented 1 year ago

Ok my bad. I think I actually had the wrong old version of fastreg. I updated from fastreg 1.1 to 1.2. Did 1.1 not have any of the non-linear routines? I was just getting the error: fastreg does not have a module called negbin or HAS_JAX. After installing 1.2, I was missing a dependency to optax, which also installed jax with it (I guess the cuda version, since I already should have had the pure CPU version). Anyway now it works, sorry for the trouble !

(Just a sidenote: do I see correctly that you changed the order of variables names and data in what fr.design_matrices returns?)

(Another sidenote: fr.negbin is giving me very different results from statsmodels.discrete.count_model.NegativeBinomialP. I will investigate further, but I assume JAX is using better starting values and or negative-log-L-minimisation algorithms? But mb you've run some simulations yourself where you compare results?)

iamlemec commented 1 year ago

Thanks for the feedback! Good to know that it's working now. Version 1.1 also had some of the non-linear routines, but they way I was importing it then, it would just fail silently if any of the dependencies was missing. I'll add in a note about optax in the Readme.

Side note 1: That is correct. Somehow, I decided that name first made a bit more sense.

Side note 2: Yeah, this is where things can get a bit murky. I can get things to line up to machine precision for the normal OLS stuff, but GLM is much trickier. I'm just using Adam with cosine decay, while statsmodels by default uses a Newton-like method, which will use Hessian information. I would assume the statsmodels results will be more accurate, but it can't really handle high dimensions. I've also run into situations where statsmodels is wrong too, and I can only get correct results from the sklearn routines (though this was for Poisson). What kind of data are you getting these sorts of outcomes for? I would be interested in testing it out and seeing if there is some way to improve it. Also, do you have a sense for which of fastreg and statsmodels is closer to correct?

vincLohm commented 1 year ago

Ok cool thanks for the clarifications !

I was using it on some panel data with year and geograhical-unit FEs. I don't really have enough observations to cleanly estimate the FEs atm. It's sort of just a detrending, mb it'd be better to manually remove the geographical-unit means. Since this is real data, I have no idea what the true parameters are.

The way you would check accuracy of these methods is: simulate data based on the assumed data generating process and see if the estimator can recover the true parameters. Ideally if you run it often enough (but may be tough here bc each estimation takes some time) you can even check if the std errors are correct too.

I already ran some sims with NB and NB zero-inflated to be sure instrumental variables (control function estimation) would work. I also once compared statsmodels NB, and zero-inflated NB output to R's package (I forget which one I used) and got the same estimates. I think before I use this extensively I'll do some sims comparing statsmodels, your fastreg, sklearn (as far as I can see that only has poisson though :( ), and the R package. Most likely discrepancies will show up when using high-dim FEs. Intuitively I would assume it gets very tough for gradient-descent methods when you have that many parameters. I'll probably write up the results in latex, just for reference for myself and could send it to you too if you're interested.