trevorstephens / gplearn

Genetic Programming in Python, with a scikit-learn inspired API
http://gplearn.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.59k stars 279 forks source link

Heuristically Guided Algorithmic Information Mutations #235

Closed jabowery closed 2 years ago

jabowery commented 3 years ago

Is your feature request related to a problem? Please describe. Ever since the discovery of Algorithmic Information Theory in the 1960s, the social sciences have been side-tracked by Shannon Information (statistics). This was an absolute disaster as it discarded the most principled model selection criterion (Algorithmic Information). This was a hyper-disaster because this was at the dawn of the computer age, when algorithms became first class objects and should have taken their rightful place in the application of computers in the social sciences. The most principled model selection criterion was discarded for the high purpose of not utilizing the power of Moore's Law in deciding the most contentious issues of society.

This hyper-disaster has now metastasized into the forefront of computer applications, Machine Learning, with the result that over-fitting concepts (eg: "bloat" in genetic programming, "spurious correlations" in statistics) and "algorithmic bias" in data science more generally, are dealt with by ad hoc stop gaps, unhinged from the rigor of Algorithmic Information that has provided the solution for over 50 years.

So, now, after more than 15 years of attempting to get at least the machine learning community to wake up this model selection criterion, and witnessing the signal being drowned out by the burgeoning mass hysteria in both machine learning and social science controversy, I'm face with the problem of, at age 67, writing code to demonstrate the application of Algorithmic Information model selection in the social sciences to address both hysterias. While it is true that "Nature" has finally come around (enough that it bothered to publish a cutesy animation to try to illustrate the "revolution in machine intelligence" to the intelligenstia) the momentum of the mass hysterias is not being overcome by mere publications and animations even from the prestigious platform of "Nature".

Describe the solution you'd like Now, although the word "selection" in the phrase "model selection" is, presumably, accommodated by the "Advanced" features of this library, permitting custom fitness calculations that approximate measures of Algorithmic Information content, I did look and found nothing in the library (or other libraries) that would accommodate heuristically guided mutations, let alone such mutations that would bias toward the parsimony favored by Algorithmic Information model selection.

As an example of the kind of problem I'm looking at, see: "More Tractable Approach To Causal Path EDA Than Exhaustive Posterior Probability of Models?". Therein I describe a process of doing statistical analyses on a wide variety of (sometimes even longitudinal) demographic variables and coming up with the maximum parsimony causal graph explaining them all in what is called a macrosocial model, asking for the heuristic techniques used in statistics. Note the statisticians have no answers as to how to utilize heuristic priors (r^2, PPoM, etc.) to generate even a directed acyclic graph of social causality. Now, although, strictly speaking, Algorithmic Information ultimately requires Turing complete models -- hence (at least tail) recursion --- a genetic program that provides merely a DAG can be the penultimate step toward the dynamical model of social causality selected by Algorithmic Information.

If there is little to no chance of this being incorporated into the library, nor of guiding me in how I should do modificaitons, I suppose I'll just hack away in a manner unlikely to be acceptable in a pull.

PS: Of some direct relevance to genetic programming is "Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory and modularity" including the author of the aforelinked Nature paper.

jabowery commented 3 years ago

At baby-step toward this, admittedly grandiose, enhancement would be to generalize the check_X_y style of interface to a check_X_Y style of interface. That is to say, permit outputs with multiple features. This would probably entail shifting from scikit-learn's interface to scikit-multi's interface which, I believe, is incorporated into scikit as sklearn.multioutput.

This is a desirable generalization in any event as the limitation to a single feature output has long ago been abandoned by the mainstream of the machine learning community and is, really, just a relic of the bad old says of SPSS.

The reason this is a step toward Algorithmic Information model selection is that calls to fit then accept the same input and output feature vectors, with the loss function driving the evolved expression to the one whose complexity in algorithmic bits of information, plus the residual errors, in correcting bits of information, is minimized. From such an expression, causal structure that controls for confounders is far more easily extracted than in conventional statistics -- regardless of what you think of as "the" dependent variable.