statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.11k stars 2.88k forks source link

Question about the meaning of freq_weight in the file generalized_linear_model.py #3138

Open huaxiuyao opened 8 years ago

huaxiuyao commented 8 years ago

Dear author,

I want to ask the meaning of parameter 'freq_weights'. I am trying to apply some weighted method on poission regression and negative binomial regression now, and the weights depend on every test sample. Similar to WLS and OLS, I want to add the weights. Could I achieve this goal just by changing the value of parameter 'freq_weights'? I think it's right, and just put this issue to confirm.

thequackdaddy commented 8 years ago

xref #2879 and #2849

I was the original author of this. I think what you are doing sounds right.

Note a point of confusion that I had when doing this originally is that there are many different types of "weights" you can consider. In R and another proprietary tool I used in the past, the glm function specifically refers to variance weights meaning that the weight changes the the observation's contribution to variance. I think most GLM literature I've read uses weight in this context as well. Freq weights essentially duplicate the record. The parameter estimates from R and this should be equivalent (as far as I understand), although the covariance matrix (and all the items therein relied upon like p-values, logliklihoods, etc.) will be different.

As an aside, after some tinkering around with my particular GLM models and the like, I ended up realizing that exposure was more appropriate for many of my models.

I'm excited to see someone (besides me) is using this, so please do keep me informed on your progress.

As a side note, @josef-pkt is far more educated than me on this tool, so anything I say here that he contradicts, go with him.

josef-pkt commented 8 years ago

Similar to OLS and WLS, the interpretation of the weights doesn't matter for the parameter estimates itself, but it affects the estimates of the scale and the covariance of the parameters, cov_params.

freq_weights are a shortcut for identical repeated observations, but if freq_weights are normalized to sum to the number of observations, nobs, then nobs is the correct number, but the scale estimate would still depend on the interpretation, which weights.

2879 was supposed to add the other types of weights but I ran out of time before summer and 0.8 release preparation.

huaxiuyao commented 8 years ago

@josef-pkt Thank you for your reply. But sorry I don't really understand what you mean. Do you mean it is wrong to apply freq_weights to negative binomial regression, whose scale is also depend on weights?

josef-pkt commented 8 years ago

@letou662012 What is the interpretation of your weights?

What I wanted to say is that freq_weights are case weights and unit tested and implemented for that case. (I don't remember if we tested negative binomial specifically, IIRC we didn't.)

If we have other kinds of weights, then the freq_weights produce in some cases the same results as those other not-yet-implemented weight types. However, that is true only for some limited results.

As @thequackdaddy mentioned, if the link function is log, then the exposure behaves in a very similar way to weights, but it weights only the exog. That covers a similar use case. The issues that @thequackdaddy posted and similar issues were from when he and I tried to figure out the definitions and impact of different weight definitions.

huaxiuyao commented 8 years ago

@josef-pkt Oh, sorry maybe I misunderstand you meaning. I think my question has been solved. And more specifically, I will provide more detail about my weights. As for OLS, it just optimise the cost L=(y-Xa)^2, a is the parameter. And WLS, it revise it as L_new=(y-Xa)^2_w(u) for each test point u. I want to do is: As for GLM, it use IRLS to optimize L=log-likelihood, And for my target, I weight it as L=(log-likelihood)_w(u) for each u.

Thank you

josef-pkt commented 8 years ago

just to summarize:

max (log-likelihood) * w(u) is what we are doing, and the estimated params only depend on w(u) independent of the interpretation. But for standard errors and inference we need a more precise definition about how w(u) should be interpreted.

to be more precise: WLS uses the interpretation of weights as var_weights, i.e. sd = 1 / sqrt(w(u) L_new=((y-Xa) / sd )^2 which again differs from the above L_new=(y-Xa)^2w(u) only in some auxiliary results.

huaxiuyao commented 8 years ago

@josef-pkt Thank you for your patient. I just only want to get w(u) and then get the regression result of u. Thus, it has been solved. Thank you again.