jpfairbanks commented 5 years ago

I think we might want to use AllenNLP on ASKE to parse the documentation and find the Verb-Argument trees in software documentation. For the Initial approaches we need a plan of how NLP on the text embedded in code/model descriptions/documentation can be used to pull out a knowledge graph of concepts and how they relate to each other.

I'm thinking that the knowledge graph will have

Vertex Types:

Code symbols
Functions
Variables
Symbols

Math/Science Concepts:
Math Expressions
Concept
Unit

Edge Types: SourceType -> DestinationType

IsCalledBy: function -> function
Co-occurs with: Any -> Any
IsComponentOf: {Symbol,Function,Variable} -> {Function, Expression}
IsMeasuredIn: Any -> Unit
Implements: Function -> {Math Expression, Concept}
VERB: Any -> Any
IsSubClassOf: SpecificConcept -> More General Concept

AllenNLP verb-argument trees can be used to find the verbs from the text and some of these relations.

If you can think of any other vertex types or relation types, add them to this list.

jpfairbanks commented 5 years ago

The AutoMATES team has a format they use for knowledge extracted from fortran codes. Called a grounded function network. We should consider integrating with their representation. https://delphi.readthedocs.io/en/master/grfn_spec.html#top-level-grfn-specification

crherlihy commented 5 years ago

Some thoughts on edge weights w/respect to path-finding algorithms: The end-user may define cost in different ways; some that come to mind are:

Computational efficiency (e.g., willing to accept a parameterized level of errror in exchange for improved runtime and/or storage requirements)
Accuracy of results relative to empirical/observational data (e.g. minimizing accumulated error and/or final error)
Explainability (this amounts to pruning the DAG; algorithm nodes can be assigned Boolean literals to indicate whether they are explainable or not; edges that connect unexplainable algorithms can be ignored)

jpfairbanks commented 5 years ago

We can pull out asymptotic runtime and accuracy guarantees. We need to extract arbitrary tags on nodes.

crherlihy commented 5 years ago

Just added IsSubClassOf: SpecificConcept -> More General Concept to the list above. Am trying to capture the OO idea of inheritance, which I think is related but distinct from the IsComponentOf: edge type. I think this can be used to capture information such as "independent variable -> Variable", which can be helpful down the road when we need information about which variables are more likely to be inputs vs.outputs of a given set of function(s).

scottagt commented 5 years ago

I've been looking at the literature surrounding the representation of programs as graphs and came across this recent paper from MSR Cambridge, called "Learning to represent programs with graphs", presented at ICLR 2018, that discusses one set of principles for doing it among other tasks they're focused on. Here's a link: https://arxiv.org/pdf/1711.00740.pdf

They define the notion of a _program _graph__ :

In addition to that, they look at two programs that use their graph data structure for various purposes:

one that tries to correct name variables based on contextual features
one that tries to learn the correct variable to use in a given program location

Those are interesting, if we define one or more learning problems to be solved as a part of particular meta-modeling objectives then perhaps they're solvable leveraging learning algorithms. For example, if we have a use-case to integrate parameters from disparate models into a single model for a new simulation, it could be interesting to determine where parameters go in augmenting one model from a change to be ported to from another by trying to predict the correct location using an approach like they mention in (2) above i.e. where should a parameter/variable of a certain type, for integration, be put from another model when integrating into a single (at whatever point its happening). I think the notion could also extend to functions as well that may be selected for integration / hybridization.

jpfairbanks commented 5 years ago

That paper is a good find, it looks like they have Syntax edges that map onto the programming language syntax and Dataflow edges that represent the flow of data between variables.

The Julia syntax is pretty simple, everything is in an Expr object which has a head which tells you what type of syntax it is and a list of args that are the subcomponents. Here is a list of what various Julia syntax forms turn into as an Expr https://docs.julialang.org/en/v1/devdocs/ast/#Surface-syntax-AST-1 that whole page is worth reading for sure.

The dataflow for julia is Single Static Assignement which can be accessed in the lowered code form in Cassette or Base.Meta.lower(quote <some code here> end)

What kind of tasks do you think we could apply to the code once we have extracted out a graph? Variable Misuse seems like a good tool for a debugger or static analysis tool, but what is the higher level goal that scientists need?

scottagt commented 5 years ago

Great question. I think to answer that will partially require outlining our initial meta-modeling uses cases. I just created a new issue #23 to start tracking ones we want to support, those that can't be supported, and the undecided. :) I think we could develop uses cases and see if there's learning tasks that would be helpful to accomplish to make them happen or build use cases based on kinds of learning tasks we think would be useful to employ, sorta top-down or bottom-up, or maybe a bit of both. I put up one potential case about merging model instances from same modeling family together where its just variables (and potential parameter settings) that are being integrated. Of course there's many more use cases of varying complexity. I think it would be useful for us to get a sense of what scientist in a particular field, in our case, Epidemiology, are wanting and/or doing when they read about new models published in the literature. Here's some questions we might want to answer to determine that (not sure where we might get data except from informal survey or reading about sociology of science in a particular discipline):

Does the scientist download a new paper and script and run it?
Do they then attempt to change the model code or integrate?
What kinds of modifications do they make? 3.1 Variables, 3.2 switching to different model families all together, keeping same data or other sim parameters? 3.3 Other things?

So, this is to say, I'm going to think about it a little more and reflect on more cases we come up with in #23 .

crherlihy commented 5 years ago

@scottagt interesting paper; thanks for posting! Am reading through now.

@jpfairbanks @scottagt re the question:

What kind of tasks do you think we could apply to the code once we have extracted out a graph?

To me it is interesting to think about graphs (derived from code) that may not be isomorphic at the lowest level (e.g., when we extract directly from code), but at higher levels of abstraction may be isomorphic. This can help us to identify families of functions, and/or when different subgraphs may be serving similar purposes (and can thus be used as component blocks for meta-modeling).

Will take a look at #23 as well.

jpfairbanks commented 5 years ago

@crherlihy I think those are good tasks on the code graphs.

When doing matching of "isomorphic" subgraphs, we will definitely have to either change level of abstraction or have a fuzzy definition of "isomorphic". I think there will be very few exact isomorphisms larger than 3-5 vertices.

jpfairbanks commented 5 years ago

I just merged #27 which forms a basis to build off of. What do you think?

jpfairbanks commented 5 years ago

We should add a specification of type constraints to the knowledge graph schema example.

Edge Relation, src type, dst type, value field, example
IsCalledBy, function, function, arguments, (integrate IsCalledBy solve! (function domain))
CoOccursWith, any, any, occurrence location, (differentialequations CoOccursWith mechanisticmodels doc1:sentence1)
IsComponentOf, concept, concept, any, (integrate IsComponentOf Quadrature.jl)
IsMeasuredIn, value, unit, nothing, (position IsMeasuredIn meters)
Implements, function, concept, nothing, (integrate Implements quadrature)
IsSubClassOf, type, type, type parameters, (SIRModel <: AbstractCompartmentModel)
Represents, type, concept, type parameters, (SIRModel Represents "disease model")
Verb, any, any, verbtoken, (quadrature Verb integration solves)

jpfairbanks commented 5 years ago

31 introduced the vertex types. We still need to write down the edge types.

jpfairbanks / SemanticModels.jl

Define Knowledge Graph Schema #2

31 introduced the vertex types. We still need to write down the edge types.