Semantic Constraints for Stats Modeling

jpfairbanks commented 5 years ago

@crherlihy and I discussed some "semantic constraints" we want to add to our framework. In building statistical models there are rules about what kind of transformations are valid for different types of measurements and their role in the model. The proposed implementation is similar to Unitful, where we tag data with its modeling role in the type system and then use multiple dispatch to enforce some semantic rules during arithmetic. The modeling functions then check the semantic roles of its inputs and errors if they are not as expected.

Static Modeling

In the regression/classification setting you have a formula like y_i ~ f(x_i). The left hand side (LHS) is the dependent variables and the right hand side (RHS) is the independent variables. You can't use a measured variable on both sides of the equation. This is easy to detect for simple enough codes, but when feature engineering comes in you might get some information from the LHS leaking into features that are engineered and then find their way onto the RHS of the regression model. We want to wrap data in types so that this can't happen.

Something like this should work:

mutable struct DepVar{T}
  value::T
end

mutable struct IndVar{T}
  value::T
end

const Var = Union{DepVar, IndVar}

# this loop is notional but the behavior is governed by the results on these 16 cases.
for op in [+,-,*,/]
   for a in {DepVar, IndVar}
       for b in {DepVar, IndVar}
             T = tfunc(typeof(a), typeof(b))
             c = T(op(a.value, b.value))
       end
   end
end

From there we should be able to implement linear regression with data stored as Xcolumn::Vector{IndVar{Float64}}, ycolumn::Vector{DepVar{Float64}}. And check that no matter how a user expresses feature engineering steps (limited to combinations of basic arithmetic) they can't accidentally build an invalid model.

Time Series modeling

In time series this gets a little more complicated because you have the lags. Do if I am computing the value y_t = f(x_t, y_{t-1}) that is ok, because everything is measured at prediction time. But y_t = f(x_t, y_t) is invalid, and y_t = f(x_{t+1}) is invalid because you can only regress on values in the past. So here we would want to do something like:

mutable struct DepVar{T,L}
  value::T
  lag::L
end

mutable struct IndVar{T,L}
  value::T
  lag::L
end

And now the arithmetic operations have to check both the "Dependent vs Independent variable" rules and the "you can only forecast based on history" rules.

A notion implementation of handling lag would be like:

f(args...) = DepVar(f(map(args) do a; a.value end),
                                 maximum(map(args) do a; a.lag end))

The time series case is a lot more subtle and I think there are multiple way to do it.

jpfairbanks commented 5 years ago

Here is the Julia implementation of the R style formula language Docs: https://github.com/JuliaStats/StatsModels.jl/blob/master/docs/src/formula.md Src: https://github.com/JuliaStats/StatsModels.jl/blob/master/src/formula.jl

I think this DSL is a good example of a pre-hoc modeling framework where there is a language for representing models and then you can express models in that framework to leverage the software built around the DSL. In the long term we would want to hook into the @formula system (w/ multiple dispatch). Suppose I have a dataframe where the columns have tags for indep/dep variable according to the experiment that collected it and a formula I want to fit to that data, can we validate the @formula to be consistent with those tags on the dataframe?

jpfairbanks commented 5 years ago

I took a crack at implementing a formula based approach (as opposed to the data tag approach in PR #186). I think we need both approaches because the formula based is more of a pre-hoc, "if you use my framework, then you can't make mistakes" and the data tag approach is more of the post-hoc, "you just write code and we will inject these types to add capability later"

Just stashing the code on this issue until we have time to integrate it.

using ModelTools

function deptree(ex::Array{Expression,1})
    map(deptree,ex)
end

function deptree(ex::Expression)
    map(deptree,ex.args)
end

function deptree(ex::Operation)
    map(deptree,ex.args)
end

function deptree(ex::Variable)
    ex.dependents
end

function deptree(ex::Constant)
    ()
end

function vartree(ex::Array{Expression,1})
    map(vartree,ex)
end

function vartree(ex::Expression)
    map(vartree,ex.args)
end

function vartree(ex::Operation)
    map(vartree,ex.args)
end

function vartree(ex::Variable)
    ex
end

function vartree(ex::Constant)
    ()
end

function treecollect!(out::Vector{T}, t) where T
    if typeof(t) == T
        push!(out, t)
        return out
    else
        for i in t
            out = treecollect!(out, i)
        end
        return out
    end
end

@Unknown t x(t) y(t) z
f(a,b) = (a+b)^2/(a*b)
eq1 = z ~ f(x,y)
@show deptree(eq1.rhs)
Set(treecollect!(Variable[], deptree(eq1.rhs)))

@Param α β
g(a,b) = (α*a+β*b)/(α*β)
h(x) = x/hypot(y)
eq2 = z ~ g(x,y)
@show deptree(eq2.rhs)
Set(treecollect!(Variable[], deptree(eq2.rhs)))

depvars(ex) = Set(treecollect!(Variable[], deptree(ex)))
varvars(ex) = Set(treecollect!(Variable[], vartree(ex)))

# @Unknown t
eq3 = t ~ x+y/z
@show depvars(eq3.lhs) ∩ depvars(eq3.rhs)

"""    RegressionModel(eq)

a formula that should be interpreted as a regression model like for GLM or StatsModels
"""
mutable struct RegressionModel
    eq::Equation
end

"""    valid(m::RegressionModel)

returns true if the formula of the regression model is valid, otherwise false. Currently just checks that the variables in the RHS do not depend on the LHS.
"""
function valid(m::RegressionModel)
    eq = m.eq
    return !(eq.lhs in collect(depvars(eq.rhs))) && !(eq.lhs in collect(varvars(eq.rhs)))
end

using Test
@test valid(RegressionModel(z~x+y)) == true
@test valid(RegressionModel(t~x+y)) == false
@test valid(RegressionModel(t~α+β)) == true
@test valid(RegressionModel(t~α+β+t)) == false
@test valid(RegressionModel(y~α+β)) == true
@test valid(RegressionModel(y~x)) == true
@test valid(RegressionModel(y~x*y)) == false
@test valid(RegressionModel(y~x/sum(x))) == true
@test valid(RegressionModel(y~x/h(x))) == false
@test valid(RegressionModel(α~x/g(x,y))) == false

jpfairbanks / SemanticModels.jl

Semantic Constraints for Stats Modeling #183

Static Modeling

Time Series modeling