Model population container type with scoring methods via settable prediction class attributes.

russelljjarvis commented 2 years ago

Related to half merged PR: (see https://github.com/scidash/sciunit/pull/173).

Although, sciunit has great existing methods for scoring collections of models. When neuron network simulators are applied to the task of single-neuron model optimization, 3 notable simulators (PyNN, NEST, SNN.jl [and Brian2]) make it more convenient to model a population of cells (N>1), than a single cell (N=1). Indeed the noted simulators appear to be byte code optimized for evaluating a cell population.

Note there are pre-existing alternative, implementations that solve this problem for example the Numba JIT adaptive exponential and Izhikevich models, mean that you can avoid using network level simulators to run single cell models.

In the context of Genetic Algorithm Optimization a NEST neuron population can conveniently map straight onto a GA chromosome population, so in theory NeuronUnit+GA should be well suited to exploit the efficiency of network simulators to evaluate whole populations.

Unfortunately in practice NEST and PyNN sim platforms are a bad match for the current design pattern of Sciunit/NeuronUnit/BluePyOpt. To clarify collecting models by putting them inside a sciunit model collection container is essentially a separate and different task to collecting models in a NEST/PyNN/SNN.jl population.

To confound matters, very many so-called Python neuron simulators are not Python native. When trying to store collections of external code sim models inside sciunit containers, memory problems ensure, byte code efficiency is poor, and parallel distributed methods are very unlikely to work.

The problem is PyNN and NEST have their own preferred and safe ways to parallelize population model evaluations, but when you try to contain many single cell models (in a sciunit collection) and distribute this code (using dask, which is not what NEST expected). Relative to the external simulator, sciunit container class seems foreign and unsuited, distributing the sciunit container using dask in python is not safe because of nested parallelization. PyNN and NEST don't really support modelling a single cell so much as creating cell populations of size N=1.

I propose a second design of sciunit model collections called SimPopulationModel. In the second design a simulator population is simply contained by a new sciunit class, and it inherits sciunit attributes (RunnableModel) etc.

In the SimPopulationModel sciunit class, model predictions and observations are just regular getter and setter based attributes that have been decoupled from sciunit generate_prediction/extract_feature methods. This means simulator users are completely free to get generate predictions/features, independently from the sciunit infrastructure. This is again necessary to exploit the efficiency of NEST to act on populations fast, which will mean that features are obtained in parallel in a highly NEST/PyNN/SNN.jl specific way.

By the time SimPopulationModel is ready to be judged all the computationally heavy work of feature extraction has been completed, and sciunit is free to do what it does best as predictions (features) and observations have now been provided by other infrastructure.

This method expects users to have pre-existing code that performs something like judge generate prediction. Prediction is then just treated as a settable object attribute, which can be updated. Doing things this way is more amenable to high throughput feature generation. Ie if EFEL generates 100 new features per model.

@all-contributors please add @russelljjarvis for ideas

rgerkin commented 2 years ago

@russelljjarvis OK, so the main idea here is that you would like for the SciUnit model being judged, and all its methods being called, to be a population of smaller, possible heterogeneous models. In the concrete example of neuron simulations, the population is a population of neurons.

As you may know, you can already set Test.prediction directly and then when calling Test.judge(model, cached_prediction=True) it will use that stored prediction rather than compute a new one. This could be made even simpler by making Model.get_prediction lookup from (for example) Model._prediction directly with getters/setters as you suggest. This is much more general than just having population models, as there are arbitrary reasons why one might want this functionality; also see related examples of caching here and through the end of that file.

Feature extraction could also be handled outside of SciUnit, and the prediction set by the external tool would include the extract features rather than something raw like a simulated trace.

What I might be misunderstanding is whether there is any specific role for the notion of populations in SciUnit or not. Since in this scenario the only thing left for SciUnit to do is to compare predictions and observations, then the corresponding Test.compute_score would just operate on arrays of predictions and observations rather than single observations and predictions, and return an array of scores (or a ScoreArray indexed by individual models in the population). So this might require a Test.compute_scores which explicitly takes prediction and observation arrays and returns score arrays, and internally runs map or dask.map to vectorize that process.

Would this do the trick?

russelljjarvis commented 2 years ago

"scenario the only thing left for SciUnit to do is to compare predictions and observations, then the corresponding Test.compute_score would just operate on arrays of predictions and observations rather than single observations and predictions, and return an array of scores (or a ScoreArray indexed by individual models in the population). So this might require a Test.compute_scores which explicitly takes prediction and observation arrays and returns score arrays, and internally runs map or dask.map to vectorize that process."

-- Yes! I totally agree with this. In the end I think I bypassed models completely and just did this (with observations, and predictions settable you can bypass models). Using StaticModels+dask seemed cumbersome, and it meant more lines of code than necessary.

The only problem is the sciunit philosophy is very model orientated up until now, in that I am not aware of any documentation where scores are obtained in the absence of models.

Below is how I used to do things, but I think its probably not memory efficient to create a heap of original SciUnit model objects, and SciUnit score objects.

A lookup cache might work. As you suggest generating a list of predictions/features outside sciunit is often the more pragmatic choice.

Pre-implemented

simulate population (with a network simulator) ie simulate a chromosome. (parallel code in external simulator managed for us).
We implement
get waveforms, do external feature extraction/get_prediction store features in a list. 3.a (optional step) instantiate static models (not runnable models, but each model has a list of predictions, and model parameters/attributes)
assign list of features to a list of models (ie ._prediction, or .set_prediction).
run judge on list of sciunit models.

It's 3 where the problem is. The choices are: (a) Make a low cost list of StaticModels, these are glorified containers for features (perhaps they could even be called "FeatureContainer")?. Optionally model attributes/parameters could be stored in the static models too. These are not runnable models, they just exist to satisfy NeuronUnits appetite to run judge on models. (b) Put a lot of effort into engineering SciUnit Runnable Population models, where everything about the model that was scored is kind of re-runnable, and stored in models.

I have used the 3a approach successfully in the past, it's flexible, the main problems are:

It feels like a hack as it's a bit contrary to the original documentation, where the predominantly judged models are sciunit Runnable models.
Its up to the user who implements 2-4 to:
Instantiate StaticModel classes.
Make sure assignments are correct, and to keep a record of which models produced which features. Steps 1-2 are probably simulator specific, perhaps there is a generic way to do just 3-4 for everyone?

scidash / sciunit

Model population container type with scoring methods via settable prediction class attributes. #199

Pre-implemented

We implement