Closed jeromekelleher closed 4 years ago
each of the population models will be nested within an organism (with the exception of the Generics), so perhaps all we need is information about the number of populations and some ID for the model itself. Using first author names might work for the latter. e.g.
homsap
3_Gut
2_Fu
1_Ten
4_Bro
5_Rag
...
too minimal?
It's a good idea @andrewkern, but it doesn't give any hint about what the model is. I don't find the names of authors very helpful here in making the models IDs memorable. At least OOA is 'out-of-Africa'.
@dschride, @reedacartwright, you've had good ideas on this type of thing before? How do we come up with short, unique, memorable tags for population models?
I still like the idea of a prefix like "2pop" or "1pop", and agree that this should be followed by a name that is descriptive of the model rather than the author name. Author name could be a suffix, but then things are ballooning a bit.
I think model names sound be treated like variable names: make them as long as necessary to make it clear what they represent.
human:out_of_africa_3pop human:out_of_africa_2pop human:africa human:america
If we want brevity, we can index the models like BLAST indexes genetic codes.
human:1 # Human model 1 is the out_of_africa_3pop model erc human:2 human:3
make them as long as necessary to make it clear what they represent
Sounds right to me.
Proposal (from the call):
from the call-- people agree that IDs should be descriptive, but the documentation is also very good on the CLI. proposal for a three part id scheme like:
somthingDescriptive_numberOfPops_3CharacterAuthorName
Are we wedded to the six character species names? I would prefer to use common names if possible, like human instead of homsap.
Here is a naming scheme with the following format: ${something_descriptive}-pop${number_of_populations}-${first_author_initial}${two_digit_date}
homsap
0: default
1: out_of_africa-3pop-G09
2: out_of_africa-2pop-T12 (default)
3: africa-1pop-T12
4: american_admixture-4pop-B11
5: archaic_admixture_and_out_of_africa-5pop-R19
6: zigzag-1pop-S14
7: ancient_eurasia-9pop-K19
dromel
0: default
1: africa_with_three_epochs-1pop-S16 (default)
2: out_of_africa_three_epochs-2pop-L06
aratha
0: default
1: altas-1pop-D17
2: africa-1pop-H18 (default)
I'm using dashes in this scheme to make it easier to separate out different components if necessary.
Models can also be accessed by number. E.g. homosap model 1 is out_of_africa-3pop-G09
. Model 0 is reserved to refer to the default model for a given species. The default can be changed between releases, but the other indexes cannot. Models can be retired but indexes cannot be reused, just as BLAST retires genetic codes. Since the default can change, it is important that the CLI reports/records the exact model used if someone requests the default.
Open question: Can two models have the same description if they have different population numbers, citations, or species?
I say "no", "no", "yes".
Maybe we should have the first two letters of the author name rather than just one? At this point, what's one more character?
@dschride I'm fine with that.
I like it a lot @reedacartwright, this is exactly the sort of thing we want. I'll think through the details...
Are we wedded to the six character species names? I would prefer to use common names if possible, like human instead of homsap.
Yes, I'm quite wedded to it. It's concise, easily described, formulaic, reasonably secure from collisions and is really helpful in organising the code. We can add in synonyms for the common names for the purposes of the CLI later, but the 3+3 scheme is helpful at organising something very messy.
Having hacked around with this a bit, I have a counter proposal:
${SomethingDescriptive}_${number_of_populations}${first_author_initial}${two_digit_date}
This gives IDs like OutOfAfrica_3G09
and AncientEurasia_9K19
:
Here's some rationale:
We want the IDs to be valid identifiers in programming language X, so using hypens within the ID won't work. Since we have to use underscore to separate the sections, we have to use CamelCase to make the ID legible.
The ID is composed of two parts: the CamelCase name following usual good practice naming things rules (long enough to be descriptive, but not too long). The second part is to make similar models unique, and is the number of populations, first letter of first author name and two digit year. In the (presumably rare) occasions we have a collision with this, we can tweak the first bit to make it different. I don' see any point in separating the number of populations from the author or putting in a "pop" fixed string. Personally, I resent every unnecessary character when working with CLIs, and it really does help for visual rendering purposes to keep the IDs short.
We can also adopt a numbering system like @reedacartwright proposes above, this is a separate, additional thing to having good, descriptive unique, long term stable names.
Any thoughts? Perhaps the name part should be camelCase rather than CamelCase?
Overall I think that sounds great. Personally I prefer camelCase to CamelCase because I find it slightly easier to type but that's just me--both are fine. I still think adding the second letter from the author name could help with both readability and reducing collisions, but maybe this is not essential.
Also, going back to the six-character species names, can we make those camelCase (or CamelCase) for readability?
I still think adding the second letter from the author name could help with both readability and reducing collisions, but maybe this is not essential.
I don't see how "3Gu09" is any more readable than "3G09" to be honest. How about we say, "if there are collisions, add more letters of the first author's surname"?
Also, going back to the six-character species names, can we make those camelCase (or CamelCase) for readability?
There's a good argument for making the species ids CamelCase for consistency all right if we go with this: demographic models:
Genetic maps:
Sorta works, doesn't it?
I've implemented this proposal in #310. Still open for discussion, please vent any/all opinions here!
@jeromekelleher I think this looks good. I'm a strongly in favor of including the author initial as you have here.
All sounds good. I was just thinking that to me Gu hints at Gutenkunst, while G = ?? But adding more as needed to avoid collisions is fine.
@reedacartwright, any thoughts here?
I like the proposal in #310, and I'm willing to go one step further: Remove the population number from the ID unless it is to avoid model collisions (say when a paper has multiple models).
This would give us the ids: OutOfAfrica_G09
and OutOfAfrica_T12
.
Thoughts?
I would vote to leave this in, otherwise when you do have collisions you get some models with the number and one without and this is determined by the order in which they are added. I don't think we want our names affected by historical stuff like this.
I'm with @dschride here --- we already have multiple models from the same paper and it's anyway useful information being communicated.
We've agreed to go ahead with this scheme. I'm going to leave this open until we have documented the ID scheme in the developer docs.
going to assign myself to this. i'll take a stab at updating the docs today
After #110 was merged we have a ID scheme for population models. The idea is to have a unique (within species), human-memorable string that is used as the shorthand way of finding models in the CLI and from
species.get_model()
(we can always add other ways of querying based on other properties later).This is leading on from #85.
What we have at the moment is:
These are not consistent, well thought through, or anything like that; it's just a starting point.
The catalog and online CLI help should be enough to explain what these mean, so I think it's OK to have reasonably terse IDs.
Any ideas on how to do this well would be great!