popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
125 stars 87 forks source link

ID scheme for population models #122

Closed jeromekelleher closed 4 years ago

jeromekelleher commented 5 years ago

After #110 was merged we have a ID scheme for population models. The idea is to have a unique (within species), human-memorable string that is used as the shorthand way of finding models in the CLI and from species.get_model() (we can always add other ways of querying based on other properties later).

This is leading on from #85.

What we have at the moment is:

homsap
    ooa_3
    ooa_2
    african   
    america
    ooa_archaic
    zigzag
dromel
    afr_3epoch
    ooa_2
aratha
    fixme

These are not consistent, well thought through, or anything like that; it's just a starting point.

The catalog and online CLI help should be enough to explain what these mean, so I think it's OK to have reasonably terse IDs.

Any ideas on how to do this well would be great!

andrewkern commented 4 years ago

each of the population models will be nested within an organism (with the exception of the Generics), so perhaps all we need is information about the number of populations and some ID for the model itself. Using first author names might work for the latter. e.g.

homsap
    3_Gut
    2_Fu
    1_Ten   
    4_Bro
    5_Rag
   ...

too minimal?

jeromekelleher commented 4 years ago

It's a good idea @andrewkern, but it doesn't give any hint about what the model is. I don't find the names of authors very helpful here in making the models IDs memorable. At least OOA is 'out-of-Africa'.

@dschride, @reedacartwright, you've had good ideas on this type of thing before? How do we come up with short, unique, memorable tags for population models?

dschride commented 4 years ago

I still like the idea of a prefix like "2pop" or "1pop", and agree that this should be followed by a name that is descriptive of the model rather than the author name. Author name could be a suffix, but then things are ballooning a bit.

reedacartwright commented 4 years ago

I think model names sound be treated like variable names: make them as long as necessary to make it clear what they represent.

human:out_of_africa_3pop human:out_of_africa_2pop human:africa human:america

If we want brevity, we can index the models like BLAST indexes genetic codes.

human:1 # Human model 1 is the out_of_africa_3pop model erc human:2 human:3

dschride commented 4 years ago

make them as long as necessary to make it clear what they represent

Sounds right to me.

petrelharp commented 4 years ago

Proposal (from the call):

andrewkern commented 4 years ago

from the call-- people agree that IDs should be descriptive, but the documentation is also very good on the CLI. proposal for a three part id scheme like:

somthingDescriptive_numberOfPops_3CharacterAuthorName

reedacartwright commented 4 years ago

Are we wedded to the six character species names? I would prefer to use common names if possible, like human instead of homsap.

reedacartwright commented 4 years ago

Naming Schemes

Here is a naming scheme with the following format: ${something_descriptive}-pop${number_of_populations}-${first_author_initial}${two_digit_date}

homsap
    0: default
    1: out_of_africa-3pop-G09
    2: out_of_africa-2pop-T12 (default)
    3: africa-1pop-T12
    4: american_admixture-4pop-B11
    5: archaic_admixture_and_out_of_africa-5pop-R19
    6: zigzag-1pop-S14
    7: ancient_eurasia-9pop-K19
dromel
    0: default
    1: africa_with_three_epochs-1pop-S16 (default)
    2: out_of_africa_three_epochs-2pop-L06
aratha
    0: default
    1: altas-1pop-D17
    2: africa-1pop-H18 (default)

I'm using dashes in this scheme to make it easier to separate out different components if necessary.

Index Scheme

Models can also be accessed by number. E.g. homosap model 1 is out_of_africa-3pop-G09. Model 0 is reserved to refer to the default model for a given species. The default can be changed between releases, but the other indexes cannot. Models can be retired but indexes cannot be reused, just as BLAST retires genetic codes. Since the default can change, it is important that the CLI reports/records the exact model used if someone requests the default.

reedacartwright commented 4 years ago

Open question: Can two models have the same description if they have different population numbers, citations, or species?

I say "no", "no", "yes".

dschride commented 4 years ago

Maybe we should have the first two letters of the author name rather than just one? At this point, what's one more character?

reedacartwright commented 4 years ago

@dschride I'm fine with that.

jeromekelleher commented 4 years ago

I like it a lot @reedacartwright, this is exactly the sort of thing we want. I'll think through the details...

jeromekelleher commented 4 years ago

Are we wedded to the six character species names? I would prefer to use common names if possible, like human instead of homsap.

Yes, I'm quite wedded to it. It's concise, easily described, formulaic, reasonably secure from collisions and is really helpful in organising the code. We can add in synonyms for the common names for the purposes of the CLI later, but the 3+3 scheme is helpful at organising something very messy.

jeromekelleher commented 4 years ago

Having hacked around with this a bit, I have a counter proposal:

 ${SomethingDescriptive}_${number_of_populations}${first_author_initial}${two_digit_date}

This gives IDs like OutOfAfrica_3G09 and AncientEurasia_9K19: Screenshot from 2019-12-09 13-41-38

Here's some rationale:

  1. We want the IDs to be valid identifiers in programming language X, so using hypens within the ID won't work. Since we have to use underscore to separate the sections, we have to use CamelCase to make the ID legible.

  2. The ID is composed of two parts: the CamelCase name following usual good practice naming things rules (long enough to be descriptive, but not too long). The second part is to make similar models unique, and is the number of populations, first letter of first author name and two digit year. In the (presumably rare) occasions we have a collision with this, we can tweak the first bit to make it different. I don' see any point in separating the number of populations from the author or putting in a "pop" fixed string. Personally, I resent every unnecessary character when working with CLIs, and it really does help for visual rendering purposes to keep the IDs short.

We can also adopt a numbering system like @reedacartwright proposes above, this is a separate, additional thing to having good, descriptive unique, long term stable names.

Any thoughts? Perhaps the name part should be camelCase rather than CamelCase?

dschride commented 4 years ago

Overall I think that sounds great. Personally I prefer camelCase to CamelCase because I find it slightly easier to type but that's just me--both are fine. I still think adding the second letter from the author name could help with both readability and reducing collisions, but maybe this is not essential.

Also, going back to the six-character species names, can we make those camelCase (or CamelCase) for readability?

jeromekelleher commented 4 years ago

I still think adding the second letter from the author name could help with both readability and reducing collisions, but maybe this is not essential.

I don't see how "3Gu09" is any more readable than "3G09" to be honest. How about we say, "if there are collisions, add more letters of the first author's surname"?

Also, going back to the six-character species names, can we make those camelCase (or CamelCase) for readability?

There's a good argument for making the species ids CamelCase for consistency all right if we go with this: demographic models:

Genetic maps:

Sorta works, doesn't it?

jeromekelleher commented 4 years ago

I've implemented this proposal in #310. Still open for discussion, please vent any/all opinions here!

andrewkern commented 4 years ago

@jeromekelleher I think this looks good. I'm a strongly in favor of including the author initial as you have here.

dschride commented 4 years ago

All sounds good. I was just thinking that to me Gu hints at Gutenkunst, while G = ?? But adding more as needed to avoid collisions is fine.

jeromekelleher commented 4 years ago

@reedacartwright, any thoughts here?

reedacartwright commented 4 years ago

I like the proposal in #310, and I'm willing to go one step further: Remove the population number from the ID unless it is to avoid model collisions (say when a paper has multiple models).

This would give us the ids: OutOfAfrica_G09 and OutOfAfrica_T12.

Thoughts?

dschride commented 4 years ago

I would vote to leave this in, otherwise when you do have collisions you get some models with the number and one without and this is determined by the order in which they are added. I don't think we want our names affected by historical stuff like this.

jeromekelleher commented 4 years ago

I'm with @dschride here --- we already have multiple models from the same paper and it's anyway useful information being communicated.

jeromekelleher commented 4 years ago

We've agreed to go ahead with this scheme. I'm going to leave this open until we have documented the ID scheme in the developer docs.

andrewkern commented 4 years ago

going to assign myself to this. i'll take a stab at updating the docs today