popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
125 stars 87 forks source link

Separate demographic models from the stdpopsim python package #1232

Open grahamgower opened 2 years ago

grahamgower commented 2 years ago

We plan to transition stdpopsim's demographic models to the Demes YAML format (#962). This means that the demographic models will become programming-language agnostic, and thus useful outside of the scope of the stdpopsim Python package. So in principle, any simulator that supports Demes could make use of the stdpopsim catalog, without depending on the stdpopsim Python package (the CLI/API that one gets with "pip install stdpopsim"). In fact, such a simulator need not even be written in Python. However, making this easy for some external simulator would require us to reorganise the stdpopsim project a little.

I've opened this issue to see what the community thinks about such a reorganisation. We had some discussion of this in the context of Demes a while ago (https://github.com/popsim-consortium/demes-python/issues/203#issuecomment-813391254), and @apragsdale raised it again recently in (#962).

To summarise, the main reasons a Python simulator may not want to depend on the stdpopsim Python package just to get the catalog models:

It would be possible for us to split stdpopsim into two Python packages, say stdpopsim and stdpopsim-catalog, where the former depends on the latter. stdpopsim-catalog would consist of just the YAML files, and maybe a helper function or two to query them. This would have a liberal license and no dependencies. The files themselves would still live inside the stdpopsim git repository, from which both python packages would be built.

I'm not sure about how to package things for non-Python software, but perhaps @apragsdale or @molpopgen have suggestions in this regard. At the very least, a non-python project could include the stdpopsim git repository as a git submodule.

PS: @jeromekelleher has also suggested we might one day move to a YAML-based format for the species info in the catalog (chromosome lengths, mutation rates, recombination maps, etc.)---sorry I couldn't find where this was posted. Of course, this would also be useful outside the context of a stdpopsim CLI/API.

molpopgen commented 2 years ago

I think that having the YAML files or whatnot as a separate package under an MIT license (or similar) would make sense. Having that would be more useful for the community, IMO.

apragsdale commented 2 years ago

I think that's exactly right @grahamgower. There are a lot of advantages for moving the demographic model catalog to its own repo, which stdpopsim imports as a submodule. There wouldn't be any issue with non-Python software doing the same, since it would just be a collection of YAML data files with minimal dependencies (e.g., I have used demes-spec as a submodule in a Julia package to test the invalid and valid test cases).

This will require some effort, but I think with some up-front planning, it would be straightforward and the benefits are worth it.

andrewkern commented 2 years ago

i'm not sure about this idea but at least part of that is my unfamiliarity with the current demes spec.

A large portion of what makes stdpopsim work is associating species specific parameters with demographic models, and then QCing those implementations-- how would a stdpopsim-catalog module achieve this? demes models aren't carrying around params like mutation and recombination rates right?

I'm also concerned about what this would mean for the QC protocol that we've been operating under. It seems like we are barely there with QC as it is, and to now reinvent this process would be a real hit to future growth.

It also strikes me that while there is value in divorcing the collection of yaml models from the python code, there would be a ton of work to do to make it happen. At this point I'm not sure we have a large user base that would benefit from this effort-- surely folks could simply copy a yaml file over if interested in a specific model?

at any rate, just some first blush thoughts

molpopgen commented 2 years ago

At this point I'm not sure we have a large user base that would benefit from this effort-- surely folks could simply copy a yaml file over if interested in a specific model?

This is not so easy -- see @grahamgower's comment re: licensing. Putting YAML files under the GPL means that software consuming those files (by including them in a repo, using the repo as a submodule), inherit the GPL license for distribution. Licensing matters!

andrewkern commented 2 years ago

so would changing our licensing fix this?

molpopgen commented 2 years ago

so would changing our licensing fix this?

Not possible -- again, @grahamgower's comment addresses that. msprime is GPL b/c of dependency on GSL.

andrewkern commented 2 years ago

but msprime is outside of stdpopsim. what is the slim license?

ah i see also GPL. so all our dependencies are GPL

molpopgen commented 2 years ago

but msprime is outside of stdpopsim. what is the slim license?

But you use the API. Therefore stdpopsim depends on a GPL API, and is thus a derivative product, IIUC.

apragsdale commented 2 years ago

A large portion of what makes stdpopsim work is associating species specific parameters with demographic models, and then QCing those implementations-- how would a stdpopsim-catalog module achieve this? demes models aren't carrying around params like mutation and recombination rates right?

It's true that the demes spec itself doesn't have information about mutation and recombination rates, genome build, etc. However, there is a metadata field that can store such information. This would be something to discuss -- what's the best way to design such a standalone catalog that includes all the information relevant for stdpopsim simulations.

As far as the QC protocol goes, the current approach is reasonable but still prone to error. We've found errors that have slipped past both the initial implementation, QC, and review. Part of the problem is readability of msprime demographic models. Demes is a lot more readable and has built in validation methods, along with visualizing capabilities. Even though there is some up front work, I think switching to Demes would make QCing easier for us and provide better QC in the long run.

grahamgower commented 2 years ago

@andrewkern writes:

i'm not sure about this idea but at least part of that is my unfamiliarity with the current demes spec.

A large portion of what makes stdpopsim work is associating species specific parameters with demographic models, and then QCing those implementations-- how would a stdpopsim-catalog module achieve this? demes models aren't carrying around params like mutation and recombination rates right?

I'm also concerned about what this would mean for the QC protocol that we've been operating under. It seems like we are barely there with QC as it is, and to now reinvent this process would be a real hit to future growth.

It also strikes me that while there is value in divorcing the collection of yaml models from the python code, there would be a ton of work to do to make it happen. At this point I'm not sure we have a large user base that would benefit from this effort-- surely folks could simply copy a yaml file over if interested in a specific model?

at any rate, just some first blush thoughts

These are all fair points, and I definitely want to avoid doing anything that would harm the stdpopsim project or make it harder for contributors. I feel that one aspect of QC we've been struggling with in stdpopsim with is that we simply need people to volunteer their time to do QC. The amount of time it takes to QC something is somewhat open-ended, and it's not necessarily rewarding. Arguably, moving to the Demes YAML format will simplify the model-writing part of any QC process, but any change to the process is going to be disruptive. So we must ensure that the benefits of change outweight the cost.

grahamgower commented 2 years ago

Assuming that #962 is completed (I think the regular stdpopsim contributors agree this is a good idea, at least in principle?), then we'll have a bunch of YAML files in the git repository and new docs on how to do QC. So the minimal changes I'm proposing in the current issue would be to (re)license the Demes YAML files such that they are public domain, MIT, CC-BY, or similar.

DISCLAIMER: I'm not a lawyer, but here is my understanding of the GPL as it applies to our current situation...

When someone does "pip install stdpopsim", the bundle of software and associated files they get are bound by the terms of the GPL (because stdpopsim uses the msprime API, which is GPL). This says that we must provide the source code for the things in the bundle, and anyone distributing modifications to that code must distribute those modifications under the terms of the GPL (ie., they must promise to give the source code to anyone who asks, such that those asking are allowed to distribute their own modificiations under the terms of the GPL, etc. ad infinitum).

But individual components can have any (GPL-compatible) license. This means that we can have an MIT-licensed file in the repository, which is distributed with the pip-installable stdpopsim package, but the pip-installable bundle is still distributed according to the terms of the GPL. But if we also distribute our non-GPL components in such a way that those components don't depend on any GPL component, then someone is free to take such non-GPL parts and use those under the terms of whatever non-GPL license is attached (e.g. the MIT terms).

Thus what I am proposing is to organise things so we can distribute the Demes YAML components on their own, with a non-GPL license. In principle this ought to be possible because a YAML file doesn't depend on any external components such as the msprime API, and the YAML file is clearly useable independently of any GPL components in the stdpopsim distribution. To make this work in practice,

molpopgen commented 2 years ago

Yes, @grahamgower's intuitions about the GPL sync up with my own. In essence, we want to avoid a user copying a file (even an MIT-licensed file) from a GPL-licensed project, or from using such a project as a git submodule. Both actions would fall, I believe, under "distribution", and GPL terms would apply, IIUC.

jeromekelleher commented 2 years ago

Licencing is a headache, and discussions of licensing issues rarely produce anything of value, in my experience.

I suggest that we separate the discussion of converting the catalog models into Demes YAML format from what license they should/could be published under. The ultimate software license used is by far the least interesting aspect of this process.

So, goal (1) would be to convert the existing models into Demes. If/when this is completed we could discuss the tedious details of licensing.

I have gone further before and suggested that we should ultimately aim to make the entire catalog staticly declared in YAML, and to decouple it from Python (I can't seem to find the discussion, though). Then there is a natural split between the catalog information and the code required to run the front-ends, and that is where I would suggest the package split should be. (i.e., we make a stdpopsim-catalog repo, which we can then relicense, if we wish)

grahamgower commented 2 years ago

I have gone further before and suggested that we should ultimately aim to make the entire catalog staticly declared in YAML, and to decouple it from Python (I can't seem to find the discussion, though).

It was PR #1020. Lots of good discussion there too.

jeromekelleher commented 2 years ago

Aha, thanks @grahamgower. That really is much nicer than the current setup isn't it? Maybe if we could stabilise the code base we make the switch at some point.

apragsdale commented 2 years ago

I liked your proposal in that closed PR as well, @jeromekelleher.

For converting the demographic models, it can be pretty easily automated, doing something like this (note, needs demes >= 0.2.1 to have the metadata field):

import stdpopsim
import demes
import os

yaml_path = "./converted_catalog"
if not os.path.exists(yaml_path):
    os.system(f"mkdir {yaml_path}")

for s in stdpopsim.all_species():
    print("converting models for", s.id)
    species_path = os.path.join(yaml_path, s.id)
    if not os.path.exists(species_path):
        os.system(f"mkdir {species_path}")
    for model in s.demographic_models:
        graph = model.model.to_demes()
        graph.generation_time = model.generation_time
        graph.time_units = "years"
        graph.description = model.description
        graph.doi = [c.doi for c in model.citations]
        graph.metadata = {  # all the extra stuff is placed in metadata
            "citations": [
                {
                    "doi": c.doi,
                    "author": c.author,
                    "year": c.year,
                    "reasons": [r for r in c.reasons],
                }
                for c in model.citations
            ],
            "long_description": model.long_description,  # may need to be formatted for \n
            "mutation_rate": model.mutation_rate,
        }
        fname = os.path.join(species_path, model.id + ".yaml")
        demes.dump(graph, fname)

The conversion (of the demographic models at least) is easy. The work will be in reconfiguring stdpopsim to read from yamls instead.