Open rgayler opened 7 years ago
Ross this would be so useful! This is such a huge issue when doing applied work in business contexts in my experience. Currently everything in my workflow gets put together on the fly - tools like this would be a really good start to what will always be a complex issue.
I tend to use data that comes from multiple points inside complex information-processing systems. The internal dynamics of these systems impose some dependencies between the variables that I can observe. These systems are generally opaque, so I can't infer the dependencies between the observable variables by looking at the mechanism that generates them.
There are often special values (e.g. missing values, sentinel values) in the observable variables. Special values are often spread between variables that have dependencies. For example, if one variable is a function of another, then the output variable will usually have a missing value if the input variable has a missing value. This provides an opportunity to try to infer the functional dependencies between observable variables from the relations between special values across variables.
A cross-tab of variable (categorised to special vs ordinary values) can be interpreted as logical entailment. I.e. Pr(A=special | B=special) = 1 can be glossed as B=special entails A=special. These entailments can be calculated for all pairs of variables (actually, variable=value tuples) and represented as directed edges in a directed acyclic graph, which can then be logically simplified and displayed. This graph suggests internal data-flows/functional-dependencies inn the opaque source system. Knowing these dependencies helps debug one's understanding of the source system and interpret the results of analyses of the observable variables.
I have some code implementing is entailment analysis in a couple of projects. It would be nice to turn this into a package or add it to the naniar package.
Possible development directions/extensions:
Make sure it plays nicely with naniar, tidyverse, and ggplot2.
I am currently using DiagrammeR to do the network graphics. I would prefer to use ggplot.
The current code has multiple package dependencies to do the entailment analysis. It might be better to minimise the dependencies.
The entailment analysis doesn't actually know about special values - it only expects that each variable is partitioned into a small number of categories. This would allow for multiple types of missing value, e.g. NA_not_applicable, NA_not_asked, NA_not_answered. It also allows for non-missing values to be categorised into broad ranges, e.g. a numeric variable could be categorised to +ve, 0, -ve. the entailment analysis doesn't know how to categorise each variable, so we need some helper functions to allow the analyst to specify what to do.
The function used to implement the entailment analysis isn't actually restricted to low-cardinality partitions of variable values. We could investigate whether it is able to detect undocumented entailments, e.g. where the source system uses some perfectly ordinary value of a variable to represent some special meaning.
The current mechanism for entailment analysis has a very primitive mechanism for excluding entailments that are likely to be chance occurrences. It would be good to have a more principled statistical filter of entailments.