Better Integration with StatsBase.RegressionModel

gragusa / CovarianceMatrices.jl

Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation for Julia.

Other

53 stars 9 forks source link

Better Integration with StatsBase.RegressionModel #19

Open Nosferican opened 7 years ago

Nosferican commented 7 years ago

Hi @gragusa. I am finishing some of the decisions for my longitudinal analysis package. The structs will all be children of StatsBase.RegressionModel and inherit from StatsBase.StatisticalModel. All methods for both abstract types will be implemented such that StatsBase.residuals(model) will return the residuals and so on with coef, dof, dof_residual, etc. In addition, all other fields which do not have a StatsBase.RegressionModel method to access are fully accessible through Base.get(model, value::Symbol). For example, if one wants to get the MRSS one just needs to call get(model, :MRSS), etc. Let me know if this scheme will work for you.

nalimilan commented 7 years ago

I don't think the get approach is very Julian. Why not just define functions for these? Depending on how general they are, we could imagine defining some of them in StatsBase.

Nosferican commented 7 years ago

I have the get since seemed like a more flexible approach than getfield when the structure or names of the fields is undetermined. It works similar to the get for nullable which returns the value. It expands Base.get on a struct it is not defined for so it respects Base. Some of the fields I used get for where:

X: for the design matrix (in 2SLS models it refers to the exogenous variables and the predicted values of the endogenous variables)
Bread: inv(X' * X) used in many calculation and especially for the variance-covariance estimations. It is cached as to eliminate the need to inverse the matrix repeatedly.
MRSS: RSS / rdf (I believe StatsBase could have some for unbiased deviance measure or so.
N: Number of observations used in the estimation. It differs from nobs for certain models such as the between estimator or first-difference.
n: Number of panels (could be adapted to number of Primary Survey Units as well)
Estimator: The estimator used.
PID: Indicators of which observations belong to each panel
TID: Indicators of which observations belong to each time period
T: Length of panels
Idiosyncratic: Idiosyncratic error component (only random effects use this one)
Individual: Individual error component (only random effects use this one)
θ: Parameter for quasi-demeaning transformation (only random effects use this one)
z: Matrix of column wise endogenous variables
Z: Matrix of instruments Some refer to the DataFrames framework such as:
Formula: The DataFrames.Formula
IV: DataFrames.Formula (endogenous ~ instruments)
Varlist: Number of variables estimated (excludes those that were dropped due to multicolinearity)
Intercept: If the model has an intercept. Some estimators force the intercept even if it is requested in the formula to not to be used. It is not used in the intermediate steps, but added in the final model.

Some of these could potentially be added to StatsBase and I would delighted to have those. However, some are not shared for all StatsBase.RegressionModel and would not want to impose values on further development if they are more harmful than beneficial. Technically every regression model has a PID and TID which in unique measurements models is just the number of the observation for PID and ones for TID. In that framework T would just be ones for each panel/PSU.

Maybe some that could be added to StatsBase are: MRSS (unbiased deviance?), N (in order to have nobs better defined), X (design matrix?)... Bread could be something to consider since many implementation do cache this value and its quite useful for many applications. Estimator* can be generalized. For example, GLM (defines and estimator as a Family and Link).

Formula and Varlists are the elements kept from DataFrames. If using the DataFrames.DataFramesRegressionModel then those could potentially be functions of the DataFrames.ModelFrame and design matrix from DataFrames.ModelMatrix.

nalimilan commented 7 years ago

Thanks for the list. get is not a good solution since it will be type-unstable.

Some of the cases you listed could be variants of existing functions, e.g. n and N could be obtained via nobs with a symbol argument. intercept could be obtained using coef or params. Others would deserve new functions in the API, like model matrix, formula... Not sure about others.

nalimilan commented 7 years ago

Also, it looks like lots of these concepts are common with MixedModels.jl. Have you looked ta that package? Cc: @dmbates

gragusa commented 7 years ago

I agree with @nalimilan that get is not the best solution here. @Nosferican take a look at https://github.com/gragusa/InstrumentalVariables.jl. There I have a small IV model where only three new functions have to be defined to use CovarianceMatrices. I understand that more complex models have more complex requirements. The one that I find pressing is nobs that means different things in different context (weighted/unweighted, data size, estimation sample size, etc.).

Wouldn't be enough to have instead of nobs something wrknobs or effnobs which would return "Number of observations used in the estimation"?

More generally, I think I am not getting part of the discussion (totally my fault). Suppose you have a linear model type. I would define the type as follow

struct MyLinPredModel <: LinPredModel
end

struct MyModel 
 model::MyLinPredModel
 ...
end

You need then to define the following methods for MyLinPredModel

function ModelMatrix(::MyLinPredModel)
end

function wrkresid(::MyLinPredModel)
end

function wrkwts(::MyLinPredModel)
end

function nobs(::MyLinPredModel)
end

vcov(x::MyModel, k::RobustVariance) = vcov(x.model, k)

Then everything should work. Could this potentially work?

Nosferican commented 7 years ago

My implementation has a very naive way of making get type-stable, but I am sure there are better ways. My get implementation is:

abstract type ModelStatistic end
struct ModelStatistic_RSS
    value::Float64
end
struct Model <: StatsBase.RegressionModel
    model_statistic::Dict{Symbol,ModelStatistic}
end
function Base.get(obj::ModelStatistic)
    getfield(obj, :value)
end
function Base.get(obj::Model, value::Symbol)
    model_stats = getfield(obj, :model_stats)
    get(model_stats, value)
end

I came up with this solution which has its pros and drawbacks. For example, it is strict in the sense no longer I can make calculations between RSS and rdf (dof_residuals) since these are ModelStatistics. Ipso facto,

struct ModelStatistic_MRSS <: ModelStatistic
    value::Float64
    function ModelStatistic_MRSS(RSS::ModelStatistic_RSS, rdf::ModelStatistic_rdf)
        RSS = get(RSS)
        rdf = get(rdf)
        RSS / rdf
    end
end

The pros is that I am less likely to make an error, and can use the inner constructors to verify properties of the statistics. It also allows the model_stats dictionary to be more efficient than using {Symbol, Any}. Any thoughts on this approach are super welcomed since I think I have a very basic understanding of Julia in general (basically what I read from the manual).

Aye. MixedModels is very similar to panel econometrics. Some of the estimators are especial cases. The difference for those estimators is on how to estimate the error components. The statistics way use MLE such as in MixedModels while in econometrics we tend to use GLS framework since it requires less strict assumptions. the two approaches also differ quite a bit in unbalanced panels which are very common in economics. The MixedModels uses a modified Formula for including endogenous variables. I decided to keep them as two different DataFrames.Formula, but if a decision has been reached on expanding DataFrames.Formula or StatsModels.Formula I could adapt to follow it.

The individual, idiosyncratic, and thetas are bunched in a dictionary as ErrorComponent in R plm package. I could follow that strategy and implement the functions to access them only for the structures that have it.

Nosferican commented 7 years ago

@gragusa I think that's a good overall scheme. The only issues I can think of at least for HC estimators is for example with HC1 or HC4 which use uses the dof value and that one differs from the one implied by size(ModelMatrix, 2). However, for fixed effects, the HC estimators are not consistent so I only allow users to specify :OLS or :PID (clustered at panels); this decision is the one implemented in Stata since Stata 12 or 13 if not mistaken. For CRVE (HC0/HC1), the code could adjust the dof by adding the ncluster - 1 to it (this would change if I implement two-way errors or allow clustering at a higher dimension in the future so is not super robust). I am not familiar with HAC for panels since people usually care more about the independence assumption and clustering is heteroscedastic, autocorrelation, and cluster robust already. Still in the extreme case that independence is valid, but heteroscedasticity and autocorrelation are present why not offer a solution for that case.

How about adding named arguments for dof and cluster ids? Then it could work something like:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64};
    dof::Int64 = size(x,2),
    cl::Vector{UnitRange{Int64}} = map(elem -> elem:elem, 1:size(x, 1)))

Then I could make StatsBase.vcov(model::MyModel, variant = VCE) a wrapper around that? It would also work quite nicely with GLM by taking the values and then passing those to the same function.

gragusa commented 7 years ago

@Nosferican In HC4 I use _nobs(l::LinPredModel) = length(l.rr.y) you could define this for your model.

You do not need arguments if you can get dof and cl from types. As @nalimilan, this would be extremely un-julian thing to do.....

I can modify CRVE to use dof. Would this be enough?

For panel with fixed effect, HC is inconsistent, but could be corrected (and this can be done inside the API (obtain the HC and the apply the bias adjustment).

Nosferican commented 7 years ago

@gragusa dof cannot be implied from the model matrix for within estimators. The size of the model matrix will give you the incorrect model degree of freedom. The adjustment factor is the number of panels - 1 (to account for the degrees of freedom used for the group means). The clusters are also not implied from the model matrix. The alternative would be to dispatch different methods:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64)
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    cl::Vector{UnitRange{Int64}})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64,
    cl::Vector{UnitRange{Int64}})

would this be the Julian-way? If such, I can adapt my code to eschew named arguments with default implied values and define different methods.

I think addressing the dof and cl should be sufficient to port the VCE estimates.

Nosferican commented 7 years ago

@nalimilan @gragusa... Considering other packages besides mine and maximum code re-use... what about

vcov(model::StatsBase.RegressionModel, variant::Symbol)
vcov(model::StatsBase.RegressionModel, variant::Symbol, cl::Vector{UnitRange{Int64}})

where

ModelMatrix = StatsBase.designmatrix(model) # or design_matrix, etc.
wrkresid = StatsBase.residuals(model)
wrkwts = StatsBase.weights(model) # or the weights naming convention used by StatsBase
_nobs = StatsBase.nobs(model)
dof = StatsBase.dof(model)

Nosferican commented 6 years ago

For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301). I would be happy to help in the transition so more people can use the package.

@lbittarello if Microeconometrics inheritances from StatsBase.RegressionModel it should play well likewise.

gragusa commented 6 years ago

It would be great to have some help. I have not touch CovarianceMatrices in a long time and I have not followed StatsModels.jl evolution.

I will try to make a overhaul of the types dependency and also improve the API. But feel free to chime in ....

On Fri, Nov 10, 2017 at 5:16 PM José Bayoán Santiago Calderón (史志鼎) < notifications@github.com> wrote:

For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32 https://github.com/JuliaStats/StatsModels.jl/issues/32. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301 https://github.com/JuliaStats/StatsBase.jl/pull/301). I would be happy to help in the transition so more people can use the package.

@lbittarello https://github.com/lbittarello if Microeconometrics http://Microeconometrics.jl inheritances from StatsBase.RegressionModel it should play well likewise.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19#issuecomment-343516632, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgsKB5l4vdCHWWnWx93uKiPTyId1sks5s1HbRgaJpZM4OabhV .

-- Giuseppe Ragusa

Nosferican commented 6 years ago

I can help with the API and HC / CRVE... HAC is outside my area of expertise so I rather let you handle everything there for now. I am envisioning something in the likes of:

mutable struct MyPkgStruct<: StatsBase.RegressionModel
    this has the various fields for each package, but has the following methods...
end

X = StatsBase.model_matrix(obj::StatsBase.RegressionModel)
û = StatsBase.residuals(obj::StatsBase.RegressionModel)
rdf = StatsBase.dof_residual(obj::StatsBase.RegressionModel)

vce = get_vce(obj::StatsBase.RegressionModel)
clusters = clusters(obj::StatsBase.RegressionModel)
weights = get_weights(obj::StatsBase.RegressionModel)
# Should the matrix already be weighted?
# These could be requested through the new StatsBase.params(obj::StatsBase.RegressionModel)

where

vce::CovarianceMatrices.RobustVariance # Maybe include OLS
X::AbstractMatrix
û::AbstractVector
rdf::Real
clusters::AbstractMatrix
weights::AbstractVector

Then we can provide a function for packages to use:

function estimate_vcov(obj::StatsBase.RegressionModel)
    X = StatsBase.model_matrix(obj)
    û = StatsBase.residuals(obj)
    rdf = StatsBase.dof_residual(obj)
    vce = get_vce(obj)
    clusters = clusters(obj)
    weights = get_weights(obj)
    return vcov(vce, X, û, weights, clusters, rdf)
end
function estimate_vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
    setfield!(obj, fieldname, estimate_vcov(obj))
    return
end

which updates their struct with the vcov::AbstractMatrix.

What are your thoughts?

Nosferican commented 5 years ago

Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.

Can you list the requirements you might need?

weights
informationmatrix
residuals
leverage
dof_residual
deviance Do the HAC estimators require a time indicator?

gragusa commented 5 years ago

Late reply —-

It depends on how we want to implement this.

The minimal interface would be for statsmodel to provide

pseudohessian
pseudoscore

Or something like this (this are the bread and meat of robust variances). Then packages should take care of implementing those.

Alternatively, the methods you propose are going to be useful if CovarianceMatrices has to implement variances for methods.

On Tue, 12 Mar 2019 at 22:10, José Bayoán Santiago Calderón (史志鼎) < notifications@github.com> wrote:

Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.

Can you list the requirements you might need?

weights

informationmatrix

residuals

leverage

dof_residual

deviance Do the HAC estimators require a time indicator?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19#issuecomment-472182503, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgvWO4_739VUv0jfvysU4CXy-KtLqks5vWBfLgaJpZM4OabhV .

-- Giuseppe Ragusa

Nosferican commented 5 years ago

The API has the methods, score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, and deviance should be more than enough for most cases (i.e., not including time or clusters).

gragusa commented 5 years ago

Does GLM implement those? If not we should write them and make a PR.

On Mon, 3 Jun 2019 at 17:03, José Bayoán Santiago Calderón < notifications@github.com> wrote:

The API has the methods, score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, deviance should be more than enough for most cases (e.g., not including time or clusters).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19?email_source=notifications&email_token=AAD5DASXRZ46FTAGEGO7MYTPYUXDXA5CNFSM4DTJXBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZWCMQ#issuecomment-498295090, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD5DAURXDYR7OTT6W5BHW3PYUXDXANCNFSM4DTJXBKQ .

-- Giuseppe Ragusa