popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
124 stars 87 forks source link

Model subclasses are just functions #269

Closed grahamgower closed 4 years ago

grahamgower commented 4 years ago

stdpopsim/catalog/homsap.py has the following comment regarding Model subclasses:

# TODO we want to move away from defining these as classes which
# can be instantiated and rather to creating *instances* of the Model
# class which has this behaviour. However, it's not clear what form
# the refactored versions would have. Marking these classes with a
# __ to emphasise that they're not supposed to be used externally
# like this, but should be found via the species catalog.

The Model subclasses just define a few attributes and an __init__ function. So they're not really classes. Using PiecewiseConstantSize and GenericIM as examples, here's what the refactored versions could look like (untested!):

diff --git a/stdpopsim/models.py b/stdpopsim/models.py
index 1dff2df..9489624 100644
--- a/stdpopsim/models.py
+++ b/stdpopsim/models.py
@@ -3,6 +3,7 @@ Common infrastructure for specifying demographic models.
 """
 import sys

+import attr
 import msprime
 import numpy as np

@@ -141,6 +142,7 @@ class Population(object):
                 "sampling_time": self.sampling_time}

+@attr.s
 class Model(object):
     """
     Class representing a simulation model that can be run to output a tree sequence.
@@ -155,19 +157,16 @@ class Model(object):
         written text, e.g., "Three population Out-of-Africa"
     :vartype informal_name: str
     """
-    # TODO the infrastructure here is left over from a structure that
-    # rigidly used class definitions as a way to define population
-    # models. This contructor should take all the instance variables
-    # as parameteters, and we should use factory functions to define
-    # the model instances that are added to the catalog rather than
-    # subclasses.
-
-    def __init__(self):
-        self.population_configurations = []
-        self.demographic_events = []
-        # Defaults to a single population
-        self.migration_matrix = [[0]]
-        self.generation_time = None
+
+    id = attr.ib(type=str, kw_only=True)
+    name = attr.ib(type=str, kw_only=True)
+    description = attr.ib(type=str, kw_only=True)
+    generation_time = attr.ib(type=int, kw_only=True)
+    populations = attr.ib(type=list, kw_only=True)
+    population_configurations = attr.ib(default=[], kw_only=True)
+    demographic_events = attr.ib(default=[], kw_only=True)
+    migration_matrix = attr.ib(default=[[0]], kw_only=True)
+    citations = attr.ib(default=[], kw_only=True)

     @property
     def num_populations(self):
@@ -249,10 +248,9 @@ _popAnc = Population(name="popAnc", description="Generic ancestral population",
                      sampling_time=None)

-class PiecewiseConstantSize(Model):
+def PiecewiseConstantSize(N0, *args):
     """
-    Class representing a generic simulation model that can be run to output a
-    tree sequence. This is a piecewise constant size model, which allows for
+    Returns a piecewise constant size model, which allows for
     instantaneous population size change over multiple epochs in a single population.

     :ivar N0: The initial effective population size
@@ -267,32 +265,29 @@ class PiecewiseConstantSize(Model):
         model1 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1)) # One change
         model2 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1), (t2, N2)) # Two changes
     """
-
-    id = "constant"
-    name = "Piecewise constant size"
-    description = "Piecewise constant size population model over multiple epochs."
-    citations = []
     populations = [_pop0]
-    author = None
-    year = None
-    doi = None
-
-    def __init__(self, N0, *args):
-        self.population_configurations = [
-            msprime.PopulationConfiguration(
-                initial_size=N0, metadata=self.populations[0].asdict())
-        ]
-        self.migration_matrix = [[0]]
-        self.demographic_events = []
-        for t, N in args:
-            self.demographic_events.append(msprime.PopulationParametersChange(
-                time=t, initial_size=N, growth_rate=0, population_id=0))
-
-
-class GenericIM(Model):
+    population_configurations = [
+        msprime.PopulationConfiguration(
+            initial_size=N0, metadata=populations[0].asdict())
+    ]
+    demographic_events = []
+    for t, N in args:
+        demographic_events.append(msprime.PopulationParametersChange(
+            time=t, initial_size=N, growth_rate=0, population_id=0))
+
+    return Model(
+            id="constant",
+            name="Piecewise constant size",
+            description="Piecewise constant size population model over "
+                        "multiple epochs.",
+            populations=populations,
+            population_configurations=population_configurations,
+            demographic_events=demographic_events)
+
+
+def GenericIM(NA, N1, N2, T, M12, M21):
     """
-    Class representing a generic simulation model that can be run to output a tree
-    sequence. A generic isolation with migration model where a single ancestral
+    Returns a generic isolation with migration model where a single ancestral
     population of size NA splits into two populations of constant size N1
     and N2 time T generations ago, with migration rates M12 and M21 between
     the split populations. Sampling is disallowed in population index 0,
@@ -319,33 +314,28 @@ class GenericIM(Model):
         model1 = stdpopsim.GenericIM(NA, N1, N2, T, M12, M21)

     """
-    id = "IM"
-    name = "Isolation with migration"
-    description = """
-        A generic isolation with migration model where a single ancestral
-        population of size NA splits into two populations of constant size N1
-        and N2 time T generations ago, with migration rates M12 and M21 between
-        the split populations.
-        """
-    citations = []
     populations = [_pop0, _pop1, _popAnc]
-    author = None
-    year = None
-    doi = None
-
-    def __init__(self, NA, N1, N2, T, M12, M21):
-        self.population_configurations = [
-            msprime.PopulationConfiguration(
-                initial_size=N1, metadata=self.populations[0].asdict()),
-            msprime.PopulationConfiguration(
-                initial_size=N2, metadata=self.populations[1].asdict()),
-            msprime.PopulationConfiguration(
-                initial_size=NA, metadata=self.populations[2].asdict())
-        ]
-        self.migration_matrix = [[0, M12, 0], [M21, 0, 0], [0, 0, 0]]
-        self.demographic_events = [
-            msprime.MassMigration(
-                time=T, source=0, destination=2, proportion=1),
-            msprime.MassMigration(
-                time=T, source=1, destination=2, proportion=1)
-        ]
+    return Model(
+            id="IM",
+            name="Isolation with migration",
+            description="""
+                A generic isolation with migration model where a single ancestral
+                population of size NA splits into two populations of constant size N1
+                and N2 time T generations ago, with migration rates M12 and M21 between
+                the split populations.
+                """,
+            population_configurations=[
+                msprime.PopulationConfiguration(
+                    initial_size=N1, metadata=populations[0].asdict()),
+                msprime.PopulationConfiguration(
+                    initial_size=N2, metadata=populations[1].asdict()),
+                msprime.PopulationConfiguration(
+                    initial_size=NA, metadata=populations[2].asdict())
+            ],
+            migration_matrix=[[0, M12, 0], [M21, 0, 0], [0, 0, 0]],
+            demographic_events=[
+                msprime.MassMigration(
+                    time=T, source=0, destination=2, proportion=1),
+                msprime.MassMigration(
+                    time=T, source=1, destination=2, proportion=1)
+            ])

Obviously, all existing model definitions would need changing, plus documentation, ...

jeromekelleher commented 4 years ago

I like it @grahamgower, I think having a function that returns a Model instance would be a nice simple way setting things up.

I'm not totally convinced this is better for the generic models (which really do represent classes of generic models that can be instantiated, and are part of the module's public namespace), but I think it'd be much better for the concrete catalog models.

gtsambos commented 4 years ago

One situation in which it would make sense to keep the models as classes rather than just functions is when you wish to simulate under slightly different demographic parameters that are drawn from a prior that is specific to the model class. Atm, it seems like stdpopsim is focused on fixing the parameters using point estimates from papers (a v worthy and ambitious goal in itself!) -- but if this broader alternative was to ever be part of stdpopsim, it might be useful to keep things as they are. In this situation, each simulation would be generated from a particular demographic object that is instantiated from a broader model class.

Btw, I don't know whether doing this would be "too broad" for stdpopsim, but I think that this would be a useful thing to at least think about given the stated aims of the consortium re: doing powerful standardised benchmarking. It would allow users to perform "sensitivity analyses" more easily - ie., tests of how robust various methods are to slight uncertainties and misspecifications in their model.

grahamgower commented 4 years ago

I really like the idea of having these kinds of models in stdpopsim. It can already be done. Either as a function,

from scipy.stats import uniform
import stdpopsim

def my_model_generator(nreps):
    N0 = uniform(100, 500).rvs(size=nreps)
    N1 = uniform(1000, 10000).rvs(size=nreps)
    T = uniform(300, 3000).rvs(size=nreps)
    for n0, n1, t in zip(N0, N1, T):
        yield stdpopsim.PiecewiseConstantSize(n0, (t, n1))

def my_model_func():
    # just get one
    return next(my_model_generator(1))

Or as a class,

class MyModelClass(stdpopsim.PiecewiseConstantSize):
    N0 = uniform(100, 500)
    N1 = uniform(1000, 10000)
    T = uniform(300, 3000)

    def __init__(self):
        return next(MyModelClass.generator(1))

    @staticmethod
    def generator(cls, nreps):
        N0 = cls.N0.rvs(size=nreps)
        N1 = cls.N1.rvs(size=nreps)
        T = cls.T.rvs(size=nreps)
        for n0, n1, t in zip(N0, N1, T):
            yield super().__init__(n0, (t, n1))

Just a matter of taste.

gtsambos commented 4 years ago

Sure - but if only implemented as a function, it's harder to extract the exact demographic parameters used in a given sim. I think it would be more natural for these parameters to be attributes of a particular model instance that are simulated from methods in the model class. In your class code above for example, these parameters can be retrieved from myModelClass.N0, whereas in the function implementation, you'd have to return the parameters along with the simulation output.

gtsambos commented 4 years ago

Maybe a better way to explain it is in @jeromekelleher's comment:

I'm not totally convinced this is better for the generic models (which really do represent classes of generic models that can be instantiated, and are part of the module's public namespace), but I think it'd be much better for the concrete catalog models.

In a world where the catalog models have random rather than fixed parameter values, they basically just become another example of a generic model.

gtsambos commented 4 years ago

Oh, I somehow didn't see that big pull request @grahamgower. Sorry if this was unhelpful to hear only after doing all that work 😞

jeromekelleher commented 4 years ago

Sure - but if only implemented as a function, it's harder to extract the exact demographic parameters used in a given sim. I think it would be more natural for these parameters to be attributes of a particular model instance that are simulated from methods in the model class. In your class code above for example, these parameters can be retrieved from myModelClass.N0, whereas in the function implementation, you'd have to return the parameters along with the simulation output.

I don't think there's any difference here between getting the attributes from a super class or from an instance @gtsambos, it's the same variables that you'd be accessing. To do things properly, you'd need to think carefully about how to arrange the relevant variables to make them accessible for generating random values. These would look quite different to the classes that we currently have (which really are just functions returning a Model instance).

So, I think we should make the design we have as simple as possible to make things easy for model implementers for now, and think about how to generate random distributions around them later. Please do open up an issue for discussion around this though --- it's a really good application of these models.

gtsambos commented 4 years ago

I guess what I'm saying is that a class makes more sense to me when a number of different attributes or outputs might need to be bundled together in some way. In the catalog models as they currently stand, this isn't an issue -- the same demographic parameters hold regardless of the simulation run. This is also basically true for the generic models at the moment, where the users are inputting their desired parameter values. In a random parameter universe, each individual simulation is associated with its own, potentially unique set of parameters specific to that particular run -- so there are several outputs that the user needs to know about, and to me it seems like it would be cleanest to do this by keeping all of the parameters together in a class.

I totally understand the point about making things as easy as possible for developers, though.

Please do open up an issue for discussion around this though --- it's a really good application of these models.

Yes, I'd be interested in getting this into stdpopsim, and I think I'd also be quite handy for the development given some of my other recent work

jeromekelleher commented 4 years ago

The functions return an instance of one of these classes though, so it's really just a semantic difference. Generating models from prior distributions will need extra infrastructure and something that looks quite different to what we have now. We're deliberately not exposing the details of how catalog models are implemented internally so that we have the freedom to rearrange things as we like. This is just an internal implementation detail, and I'm sure it'll all be refactored and rearranged many times --- but none of this will affect end user. It's easy to refactor model definitions when they've been submitted and QC'd, but we have to make adding them as simple and streamlined as possible. The class infrastructure is just confusing boilerplate at the moment, so I say we get rid of it.

gtsambos commented 4 years ago

Okay, I'm coming around -- that makes sense! :)

grahamgower commented 4 years ago

If all models are classes, and parameters are required to be attributes (as for N0, N1, T, in MyModelClass below), then this sort of model abuse would be possible:

from scipy.stats import uniform
import stdpopsim

class MyModelClass(stdpopsim.PiecewiseConstantSize):
    N0 = 100
    N1 = 1000
    T = 600
    def __init__(self):
        super().__init__(self.N0, (self.T, self.N1))

    # this could be a classmethod of stdpopsim.Model
    @classmethod
    def frob(cls, **kwargs):
        attr = {}
        for param, distrib in kwargs.items():
            attr[param] = distrib.rvs(size=1)[0]
        newclass = type("Frobbed"+cls.__name__, (cls,), attr)
        return newclass()

m = MyModelClass.frob(T=uniform(300, 3000))
print(m.T)
print(MyModelClass.T)

Although this kind of approach makes it hard to have a bound for one random variable depend on the value drawn for another.