Define Ambiguous types - Githubissues

jpfairbanks commented 5 years ago

Scientific code is written with underspecified types. There are values of different semantic meanings that are represented with the same DataType in the code. For example there are often many different variables of type Vector{Float64} where the entries in the vectors represent dramatically different concepts. In a perfect world, every different modeling concept would have its own type with semantics defined so that the compiler could check that the program "follows the rules" of how those concepts should interact. For example, Unitful creates new types for unitful quantities and enables the compiler to enforce the semantics of dimensional analysis.

We discussed one heuristic for finding variables of the same type that should actually be different types. "If two variables take disjoint sets of values, then we should split that type into two more specific types." A more general form of this would be:

Let C be a class of models and f be an oracle that determines f(m) = true iff m in C. Define a function swap(m, a, b) that takes the model m and swaps all occurances of the value a with the value b. If f(swap(m,a,b)) = f(m) then a and b are interchangeable.

We want to define a program as well typed if for every pair of values a,b, a,b are interchangeable if and only if they are the same type. I think this would go a long way towards making the concept of "the type system represents the semantics of the model" into a rigorous mathematical concept.

jpfairbanks commented 5 years ago

This recovers dimensional analysis, and the linear algebra matrix shape rules enforced by most linear algebra systems.

For example in regular julia code that uses Array{T,N} the array type captures the information about the dimensionality, but not the sizes. When you use [StaticArrays] (http://juliaarrays.github.io/StaticArrays.jl/stable/pages/api.html) the types encode the sizes and you could statically reason about whether the code will raise a DimensionError

jpfairbanks commented 5 years ago

For example

function p(a::Matrix{T},b::Matrix{T})
    return a*b + a
end
x = rand(Float64, (5,2))
y = rand(Float64, (2,2))
p(x,y)

if you define C as the class of valid linear algebra models, f can be computed by running the code and making sure there were no DimensionMismatch exceptions.

If you apply the only available swap (swap!(m, x,y)), then you get

function p(a::Matrix{T},b::Matrix{T})
    return a*b + a
end
y = rand(Float64, (5,2))
x = rand(Float64, (2,2))
p(x,y)

which errors because a*b will be incompatible.

but if you then refine the types to use SMatrix, everything works.

function p2(a::SMatrix{T, Val{M1,N1}},b::SMatrix{T, Val{(M2,N2)}})
    return a*b + a
end
y = SMatrix(rand(Float64, (5,2)))
x = SMatrix(rand(Float64, (2,2)))
p(x,y)

there is nothing to swap and so there are no ambiguities in p2.

crherlihy commented 5 years ago

A basic workflow that this type of formalization can support would be:

For a given program $p_1$, identify/detect potentially ambiguous types.
- For all elements of a given data type, perform pairwise swaps, and observe whether oracle still returns true. Note that we assume here that we're able to filter the set of candidate data types, to avoid (in most cases) primitive-level swaps.
Transform $p_1$ to $p_2$ by introducing structs and/or more precise data types to resolve ambiguities.
- Most likely a human (expert)-in-the-loop process, where we programatically generate viable stubs, and the scientist fills in these stubs with implementations.
If we iteratively apply step one to $p_2$, the ambiguities we initially detected will have been resolved, and will not be identified as candidates for type disambiguation.

This issue touches on the validation task in that we should expect some swaps to result in alterations to programs that do not halt execution, but do meaningfully alter results. Ideally, we'd be able to flag inappropriate swaps of this nature by observing cases where the results represent marked deviations from previously observed values/outputs.

jpfairbanks commented 5 years ago

One thing we are going to run into here is values that get passed to kwargs and other "optioney" things, we would have to introduce a lot of types like Val(n::Int), Natural{Int}, Probability{R} where R <: Real in order to prevent negative numbers from going places they shouldn't

jpfairbanks commented 5 years ago

This paper gives a CT treatment of linear algebra. https://hal.inria.fr/hal-00919866/document

crherlihy commented 5 years ago

Going to tag this with punt because full implementation of the oracle approach we discussed won't happen for this release, but would be a natural next step for next cycle.

jpfairbanks / SemanticModels.jl

Define Ambiguous types #121