Explain how model augmentation relates to the broader project goals

jpfairbanks commented 5 years ago

We are doing model augmentation in order to build a data structure representation of changes to models. This enables us to lift the level of reasoning from changes to code that implements models to changes on this model structure, and allows algorithms to generate transformations on the models.

How does Model Augmentation relate to the broader goals fo AI for Science?
How can we develop learning algorithms for choosing the right model?

infvie commented 5 years ago

I think that this the way for us to be thinking about this is that its adding a level of abstraction. I think a lot of procedures require significant aprior knowledge about the subjects. Regression is a good example of this. We know that A relates to B in some capacity so we can go super into the details to find the elements that map to one another. If you looked at sand paper close up and a sand dune they'd look the same. I think when abstract these models to a higher level we take away that expectation to know this before hand and get a better look at the whole picture. Our niche is going to be trying to balance between these macro and micro views of these problems.

I think in terms of algorithms we need some kind of fuzzy search to detect the largest shared sub-graph between two programs and then we can see how the nodes from the terminal compare and use those to figure out the options of what is a reasonable modification and what is not reasonable.

jpfairbanks commented 5 years ago

Yeah these are good perspectives. I also think that we can summarize the benefits as

Abstraction

Modeling operations have similarities across domains and we can build general model augmentations that let scientists translate operations from one domain to another. The code that defines transformations is also "general modeling code" so our abstraction is closed.

Symbolification

The geometric perspective is really great for proving things about shapes, but developing algorithms requires adopting a symbolic perspectrive like algebra. Our example of polynomial regression connects here because we are able to write algorithms for model selection that leverage the symbolic nature of the transformations. In fact we can give examples of model selection in terms of ideals. The algebra of the transformation space is a place for algorithms on the model space.

Open question: Can we lift the greatest common divisor of polynomials to be the "greatest common submodel" for least squares regression? If so, does the euclidean algorithm for GCD give a model selection algorithm?

Metaprogramming for Science

Scientific models are so diverse that we need the full flexibility of code as input for our modeling framework. This is somewhat inherent to the scientific process. Scientists who are pushing the field in modeling are often inventing or applying new algorithms that are capable of solving those models. Also the first formulation of a new model is not the most elegant and so we need to be able to operate on ad-hoc models before we understand the class of models well enough for an elegant formulation to get added to the modeling framework.

Metaprogramming is about writing programs that write programs, so it makes sense that metamodeling is about writing models that write models. In order to write models that can generate models, there needs to be a compact and parsimonious representation of the model for algorithms to manipulate. As we have seen in writing our post-hoc modeling framework, scientific models diverse and hard to summarize, however the transformations that can be applied to a model while preserving its validity within the class of models is often much more structured than the models themselves. This is why we think that metamodels will work on these transformations instead of on the models directly.

Again we look to our polynomial regression problem, with only two transformations you can generate the entire class of polynomial regression problems from a model that computes linear regression. Algorithms that work on the polynomial regression models directly would have to manage a lot of complexity around arguments, data flow, conditional logic, I/O. But in the transformation state there is just f(x) -> xf(x) and f(x) -> f(x) + 1 which are simple transformations.

By representing complex models as transformations of a base model, under an algebra of transformations, we are able to make metaprogramming for science much easier.

Model Synthesis

One goal of the program is to get to the point where we can automatically infer how to combine models based on what they compute. The idea of model circuits based on signal flow graphs (see #137) is that you can statically connect models with a wiring diagram and then evaluate the diagram to compute the combined model. General DAGs are hard to compose and are typically written with either a declarative DAG language or an imperative DAG building library.

We think that the category theory approach where you have a category of diagrams that can be combined with sum aka disjoint union, and product aka function composition is the write approach. This approach leads to diagrams with precise semantics and a general purpose computation network. These networks will be specified with code that combines the sum and product operation in a hierarchical expression just like regular code. Thus the code that makes the diagrams is a model that we can augment with our current ModelTools techniques.

These "model circuits" can thus be built out of code resulting from transformations on code that builds a base circuit. Which establishes tools for creating high level transformations on circuits. We can then define the input and output wires as the modeling concepts we know and want to know and then build algorithms for solving for the circuit that gets from the inputs to the outputs. We suspect a dynamic programming approach to recursively bring the inputs and outputs closer together will solve this problem. The nature of "closer together" must mean something in the semantic space informed by the text in expository materials about the models.

crherlihy commented 5 years ago

To add some thoughts here:

How does Model Augmentation relate to the broader goals fo AI for Science?

We are endeavoring to augment scientific workflows by meaningfully (e.g., non-trivially) pruning the metamodel generation state space. When scientists/researchers design algorithms and methods to investigate and/or solve particular classes of problems, they must confront: (1) known unknowns (e.g., a class of parameters or components they are aware of and hypothesize must be required, but about which there may be uncertainty regarding values, and/or methods to compute); (2) unknown unknowns (parameters/components the scientist does not yet know about, or may know about but is not yet aware will be required for the current task).
By working within a specific domain and comparing known related works (e.g., programs), we can help a scientist make progress on (1); by integrating across domains at semantically meaningful/unitfully valid endpoints (e.g., what's the right chain of API calls? what types of conversion functions are needed but not yet defined? what functions/models can we use as metaphors (e.g., graph homomorphisms) for our current problem?), we can help a scientist make progress on (2).
In a broader/more abstract sense, our potential contributions to "AI for Science" are related to uncertainty quantification, and we endeavor to help scientists identify/assess and reduce both aleatoric and epistemic uncertainty.

How can we develop learning algorithms for choosing the right model?

This question carries with it the implicit task of defining what it means to "choose the right model"; in practice, this is unlikely to be a Boolean classification task. Rather, graph traversal and/or detection of graph homomorphisms across programs can be expected to yield a subset of models which are semantically viable.
In some cases, canonical error minimization may be the preferred method of comparison; in others, a scientist may wish to maximize novelty (think entropy related to current/previously seen systems of models, etc.), and/or minimize computational time. We can certainly envision a future where this is a DNN-driven, semi-supervised task, where our system ingests program(s), outputs viable (meta)program(s), and improves via user feedback on the "utility" of system outputs.

jpfairbanks / SemanticModels.jl