ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf
http://ozunconf17.ropensci.org/
24 stars 6 forks source link

special value entailment analysis #7

Open rgayler opened 7 years ago

rgayler commented 7 years ago

I tend to use data that comes from multiple points inside complex information-processing systems. The internal dynamics of these systems impose some dependencies between the variables that I can observe. These systems are generally opaque, so I can't infer the dependencies between the observable variables by looking at the mechanism that generates them.

There are often special values (e.g. missing values, sentinel values) in the observable variables. Special values are often spread between variables that have dependencies. For example, if one variable is a function of another, then the output variable will usually have a missing value if the input variable has a missing value. This provides an opportunity to try to infer the functional dependencies between observable variables from the relations between special values across variables.

A cross-tab of variable (categorised to special vs ordinary values) can be interpreted as logical entailment. I.e. Pr(A=special | B=special) = 1 can be glossed as B=special entails A=special. These entailments can be calculated for all pairs of variables (actually, variable=value tuples) and represented as directed edges in a directed acyclic graph, which can then be logically simplified and displayed. This graph suggests internal data-flows/functional-dependencies inn the opaque source system. Knowing these dependencies helps debug one's understanding of the source system and interpret the results of analyses of the observable variables.

I have some code implementing is entailment analysis in a couple of projects. It would be nice to turn this into a package or add it to the naniar package.

Possible development directions/extensions:

stephstammel commented 6 years ago

Ross this would be so useful! This is such a huge issue when doing applied work in business contexts in my experience. Currently everything in my workflow gets put together on the fly - tools like this would be a really good start to what will always be a complex issue.