Open Nosferican opened 7 years ago
I don't think the get
approach is very Julian. Why not just define functions for these? Depending on how general they are, we could imagine defining some of them in StatsBase.
I have the get
since seemed like a more flexible approach than getfield
when the structure or names of the fields is undetermined. It works similar to the get
for nullable
which returns the value. It expands Base.get
on a struct it is not defined for so it respects Base
. Some of the fields I used get
for where:
X
: for the design matrix (in 2SLS models it refers to the exogenous variables and the predicted values of the endogenous variables)Bread
: inv(X' * X)
used in many calculation and especially for the variance-covariance estimations. It is cached as to eliminate the need to inverse the matrix repeatedly.MRSS
: RSS / rdf (I believe StatsBase
could have some for unbiased deviance measure or so.N
: Number of observations used in the estimation. It differs from nobs
for certain models such as the between estimator or first-difference.n
: Number of panels (could be adapted to number of Primary Survey Units as well)Estimator
: The estimator used.PID
: Indicators of which observations belong to each panelTID
: Indicators of which observations belong to each time periodT
: Length of panelsIdiosyncratic
: Idiosyncratic error component (only random effects use this one)Individual
: Individual error component (only random effects use this one)θ
: Parameter for quasi-demeaning transformation (only random effects use this one)z
: Matrix of column wise endogenous variablesZ
: Matrix of instruments
Some refer to the DataFrames
framework such as:Formula
: The DataFrames.Formula
IV
: DataFrames.Formula
(endogenous ~ instruments)Varlist
: Number of variables estimated (excludes those that were dropped due to multicolinearity)Intercept
: If the model has an intercept. Some estimators force the intercept even if it is requested in the formula to not to be used. It is not used in the intermediate steps, but added in the final model.Some of these could potentially be added to StatsBase
and I would delighted to have those. However, some are not shared for all StatsBase.RegressionModel
and would not want to impose values on further development if they are more harmful than beneficial. Technically every regression model has a PID and TID which in unique measurements models is just the number of the observation for PID and ones for TID. In that framework T would just be ones for each panel/PSU.
Maybe some that could be added to StatsBase
are: MRSS (unbiased deviance?), N (in order to have nobs
better defined), X (design matrix?)... Bread could be something to consider since many implementation do cache this value and its quite useful for many applications. Estimator* can be generalized. For example, GLM (defines and estimator as a Family
and Link
).
Formula
and Varlists
are the elements kept from DataFrames
. If using the DataFrames.DataFramesRegressionModel
then those could potentially be functions of the DataFrames.ModelFrame
and design matrix from DataFrames.ModelMatrix
.
Thanks for the list. get
is not a good solution since it will be type-unstable.
Some of the cases you listed could be variants of existing functions, e.g. n
and N
could be obtained via nobs
with a symbol argument. intercept
could be obtained using coef
or params
. Others would deserve new functions in the API, like model matrix, formula... Not sure about others.
Also, it looks like lots of these concepts are common with MixedModels.jl. Have you looked ta that package? Cc: @dmbates
I agree with @nalimilan that get
is not the best solution here. @Nosferican take a look at https://github.com/gragusa/InstrumentalVariables.jl. There I have a small IV model where only three new functions have to be defined to use CovarianceMatrices
. I understand that more complex models have more complex requirements. The one that I find pressing is nobs
that means different things in different context (weighted/unweighted, data size, estimation sample size, etc.).
Wouldn't be enough to have instead of nobs
something wrknobs
or effnobs
which would return "Number of observations used in the estimation"?
More generally, I think I am not getting part of the discussion (totally my fault). Suppose you have a linear model type. I would define the type as follow
struct MyLinPredModel <: LinPredModel
end
struct MyModel
model::MyLinPredModel
...
end
You need then to define the following methods for MyLinPredModel
function ModelMatrix(::MyLinPredModel)
end
function wrkresid(::MyLinPredModel)
end
function wrkwts(::MyLinPredModel)
end
function nobs(::MyLinPredModel)
end
vcov(x::MyModel, k::RobustVariance) = vcov(x.model, k)
Then everything should work. Could this potentially work?
My implementation has a very naive way of making get type-stable, but I am sure there are better ways. My get implementation is:
abstract type ModelStatistic end
struct ModelStatistic_RSS
value::Float64
end
struct Model <: StatsBase.RegressionModel
model_statistic::Dict{Symbol,ModelStatistic}
end
function Base.get(obj::ModelStatistic)
getfield(obj, :value)
end
function Base.get(obj::Model, value::Symbol)
model_stats = getfield(obj, :model_stats)
get(model_stats, value)
end
I came up with this solution which has its pros and drawbacks. For example, it is strict in the sense no longer I can make calculations between RSS
and rdf
(dof_residuals) since these are ModelStatistics
. Ipso facto,
struct ModelStatistic_MRSS <: ModelStatistic
value::Float64
function ModelStatistic_MRSS(RSS::ModelStatistic_RSS, rdf::ModelStatistic_rdf)
RSS = get(RSS)
rdf = get(rdf)
RSS / rdf
end
end
The pros is that I am less likely to make an error, and can use the inner constructors to verify properties of the statistics. It also allows the model_stats
dictionary to be more efficient than using {Symbol, Any}
. Any thoughts on this approach are super welcomed since I think I have a very basic understanding of Julia in general (basically what I read from the manual).
Aye. MixedModels
is very similar to panel econometrics. Some of the estimators are especial cases. The difference for those estimators is on how to estimate the error components. The statistics way use MLE such as in MixedModels
while in econometrics we tend to use GLS framework since it requires less strict assumptions. the two approaches also differ quite a bit in unbalanced panels which are very common in economics.
The MixedModels
uses a modified Formula
for including endogenous variables. I decided to keep them as two different DataFrames.Formula
, but if a decision has been reached on expanding DataFrames.Formula
or StatsModels.Formula
I could adapt to follow it.
The individual, idiosyncratic, and thetas are bunched in a dictionary as ErrorComponent in R plm
package. I could follow that strategy and implement the functions to access them only for the structures that have it.
@gragusa I think that's a good overall scheme. The only issues I can think of at least for HC estimators is for example with HC1 or HC4 which use uses the dof
value and that one differs from the one implied by size(ModelMatrix, 2). However, for fixed effects, the HC estimators are not consistent so I only allow users to specify :OLS
or :PID
(clustered at panels); this decision is the one implemented in Stata since Stata 12 or 13 if not mistaken. For CRVE (HC0/HC1), the code could adjust the dof
by adding the ncluster - 1 to it (this would change if I implement two-way errors or allow clustering at a higher dimension in the future so is not super robust). I am not familiar with HAC for panels since people usually care more about the independence assumption and clustering is heteroscedastic, autocorrelation, and cluster robust already. Still in the extreme case that independence is valid, but heteroscedasticity and autocorrelation are present why not offer a solution for that case.
How about adding named arguments for dof and cluster ids? Then it could work something like:
vcov(x::Matrix{Float64},
k::RobustVariance,
û::Vector{Float64},
wrkwts::Vector{Float64};
dof::Int64 = size(x,2),
cl::Vector{UnitRange{Int64}} = map(elem -> elem:elem, 1:size(x, 1)))
Then I could make StatsBase.vcov(model::MyModel, variant = VCE)
a wrapper around that? It would also work quite nicely with GLM
by taking the values and then passing those to the same function.
@Nosferican In HC4
I use _nobs(l::LinPredModel) = length(l.rr.y)
you could define this for your model.
You do not need arguments if you can get dof
and cl
from types. As @nalimilan, this would be extremely un-julian thing to do.....
I can modify CRVE
to use dof
. Would this be enough?
For panel with fixed effect, HC is inconsistent, but could be corrected (and this can be done inside the API (obtain the HC and the apply the bias adjustment).
@gragusa dof
cannot be implied from the model matrix for within estimators. The size of the model matrix will give you the incorrect model degree of freedom. The adjustment factor is the number of panels - 1 (to account for the degrees of freedom used for the group means). The clusters are also not implied from the model matrix. The alternative would be to dispatch different methods:
vcov(x::Matrix{Float64},
k::RobustVariance,
û::Vector{Float64},
wrkwts::Vector{Float64})
vcov(x::Matrix{Float64},
k::RobustVariance,
û::Vector{Float64},
wrkwts::Vector{Float64},
dof::Int64)
vcov(x::Matrix{Float64},
k::RobustVariance,
û::Vector{Float64},
wrkwts::Vector{Float64},
cl::Vector{UnitRange{Int64}})
vcov(x::Matrix{Float64},
k::RobustVariance,
û::Vector{Float64},
wrkwts::Vector{Float64},
dof::Int64,
cl::Vector{UnitRange{Int64}})
would this be the Julian-way? If such, I can adapt my code to eschew named arguments with default implied values and define different methods.
I think addressing the dof
and cl
should be sufficient to port the VCE estimates.
@nalimilan @gragusa... Considering other packages besides mine and maximum code re-use... what about
vcov(model::StatsBase.RegressionModel, variant::Symbol)
vcov(model::StatsBase.RegressionModel, variant::Symbol, cl::Vector{UnitRange{Int64}})
where
ModelMatrix = StatsBase.designmatrix(model) # or design_matrix, etc.
wrkresid = StatsBase.residuals(model)
wrkwts = StatsBase.weights(model) # or the weights naming convention used by StatsBase
_nobs = StatsBase.nobs(model)
dof = StatsBase.dof(model)
For Julia 0.7 the migration to a DataFrames
and StatsModels
ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel
which GLM
and this package rely to StatsBase.RegressionModel
inheritance StatsModels#32. I have added a few methods to StatsBase
which will help the transition like coefnames
and model_matrix
(Commit #301). I would be happy to help in the transition so more people can use the package.
@lbittarello if Microeconometrics inheritances from StatsBase.RegressionModel
it should play well likewise.
It would be great to have some help. I have not touch CovarianceMatrices in a long time and I have not followed StatsModels.jl evolution.
I will try to make a overhaul of the types dependency and also improve the API. But feel free to chime in ....
On Fri, Nov 10, 2017 at 5:16 PM José Bayoán Santiago Calderón (史志鼎) < notifications@github.com> wrote:
For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32 https://github.com/JuliaStats/StatsModels.jl/issues/32. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301 https://github.com/JuliaStats/StatsBase.jl/pull/301). I would be happy to help in the transition so more people can use the package.
@lbittarello https://github.com/lbittarello if Microeconometrics http://Microeconometrics.jl inheritances from StatsBase.RegressionModel it should play well likewise.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19#issuecomment-343516632, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgsKB5l4vdCHWWnWx93uKiPTyId1sks5s1HbRgaJpZM4OabhV .
-- Giuseppe Ragusa
I can help with the API and HC / CRVE... HAC is outside my area of expertise so I rather let you handle everything there for now. I am envisioning something in the likes of:
mutable struct MyPkgStruct<: StatsBase.RegressionModel
this has the various fields for each package, but has the following methods...
end
X = StatsBase.model_matrix(obj::StatsBase.RegressionModel)
û = StatsBase.residuals(obj::StatsBase.RegressionModel)
rdf = StatsBase.dof_residual(obj::StatsBase.RegressionModel)
vce = get_vce(obj::StatsBase.RegressionModel)
clusters = clusters(obj::StatsBase.RegressionModel)
weights = get_weights(obj::StatsBase.RegressionModel)
# Should the matrix already be weighted?
# These could be requested through the new StatsBase.params(obj::StatsBase.RegressionModel)
where
vce::CovarianceMatrices.RobustVariance # Maybe include OLS
X::AbstractMatrix
û::AbstractVector
rdf::Real
clusters::AbstractMatrix
weights::AbstractVector
Then we can provide a function for packages to use:
function estimate_vcov(obj::StatsBase.RegressionModel)
X = StatsBase.model_matrix(obj)
û = StatsBase.residuals(obj)
rdf = StatsBase.dof_residual(obj)
vce = get_vce(obj)
clusters = clusters(obj)
weights = get_weights(obj)
return vcov(vce, X, û, weights, clusters, rdf)
end
function estimate_vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
setfield!(obj, fieldname, estimate_vcov(obj))
return
end
which updates their struct with the vcov::AbstractMatrix
.
What are your thoughts?
Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.
Can you list the requirements you might need?
weights
informationmatrix
residuals
leverage
dof_residual
deviance
Do the HAC
estimators require a time indicator?Late reply —-
It depends on how we want to implement this.
The minimal interface would be for statsmodel to provide
Or something like this (this are the bread and meat of robust variances). Then packages should take care of implementing those.
Alternatively, the methods you propose are going to be useful if CovarianceMatrices has to implement variances for methods.
On Tue, 12 Mar 2019 at 22:10, José Bayoán Santiago Calderón (史志鼎) < notifications@github.com> wrote:
Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.
Can you list the requirements you might need?
- weights
- informationmatrix
- residuals
- leverage
- dof_residual
- deviance Do the HAC estimators require a time indicator?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19#issuecomment-472182503, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfRgvWO4_739VUv0jfvysU4CXy-KtLqks5vWBfLgaJpZM4OabhV .
-- Giuseppe Ragusa
The API has the methods,
score(obj::StatisticalModel)
and informationmatrix(model::StatisticalModel; expected::Bool = true)
. That should be the minimum interface. The additional accessors dof_residual
, weights
, and deviance
should be more than enough for most cases (i.e., not including time or clusters).
Does GLM implement those? If not we should write them and make a PR.
On Mon, 3 Jun 2019 at 17:03, José Bayoán Santiago Calderón < notifications@github.com> wrote:
The API has the methods, score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, deviance should be more than enough for most cases (e.g., not including time or clusters).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gragusa/CovarianceMatrices.jl/issues/19?email_source=notifications&email_token=AAD5DASXRZ46FTAGEGO7MYTPYUXDXA5CNFSM4DTJXBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZWCMQ#issuecomment-498295090, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD5DAURXDYR7OTT6W5BHW3PYUXDXANCNFSM4DTJXBKQ .
-- Giuseppe Ragusa
Hi @gragusa. I am finishing some of the decisions for my longitudinal analysis package. The structs will all be children of
StatsBase.RegressionModel
and inherit fromStatsBase.StatisticalModel
. All methods for both abstract types will be implemented such thatStatsBase.residuals(model)
will return the residuals and so on withcoef
,dof
,dof_residual
, etc. In addition, all other fields which do not have aStatsBase.RegressionModel
method to access are fully accessible throughBase.get(model, value::Symbol)
. For example, if one wants to get the MRSS one just needs to callget(model, :MRSS)
, etc. Let me know if this scheme will work for you.