tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

[Docs] Update front matter #993

Closed jeromekelleher closed 3 years ago

jeromekelleher commented 4 years ago

The introduction ("reimplementation of ms") needs to be modified; maybe also think of what else should be there, as a landing page?

Once we have the right description, make sure this is propagated to the README.md, README.rst etc so that it shows up in the PyPI page as well.

jeromekelleher commented 3 years ago

Partially done in recent changes. New short description is "Simulate genealogical trees and genomic sequence data using retrospective population genetic models."

Technically I guess the mutation models aren't particularly retrospective :shrug: Hard to get across the idea of simulating trees and mutations. I guess we could drop the "retrospective" with loss of much info.

@tskit-dev/all Any thoughts on what the quick tagline for msprime should be?

hyanwong commented 3 years ago

To me "genealogical" means pedigrees. I have to have an extra word in there, like "gene genealogies" to trip the appropriate switch in my head.

I guess if all the people looking at this are popgen folk, then maybe that's not such an issue? But a.g. animal scientists might be expecting a pedigree generator?

agladstein commented 3 years ago

"retrospective" confuses me. I think "does that mean backwards?". We don't want the word "coalescent" in there somewhere? What about "genealogical trees of population history and resulting genomic sequence data"?

castedo commented 3 years ago

I'm betting that the motivation to mention retrospective and/or coalescent is to help visitors understand what makes this particular simulator special. I'm thinking there are four notable properties that make msprime awesome:

Do folks have an opinion on which of those four (or more) are the most important benefits to highlight? I'm guessing that the first 3 are the most important (and the coalescent grounding is good to mention secondarily)

Regarding WHAT gets simulated, is:

accurate as two most important things getting simulated?

Interestingly, I never used the term "tree" or "genealogy" in above.

petrelharp commented 3 years ago

I vote for "Simulate genealogical trees and genomic sequence data using coalescent population genetic models." But I know some of us differ in the precise meaning of some of those words - "retrospective" is fine with me also.

The goal here is to be descriptive and not wrong - we can't be precise and still be understandable, because we don't have a commonly-understandable term in english that means just what we want, AFAIK. So, while it's true that "genealogical tree" can also mean other things, it's not wrong (we simulate trees that are part of the genealogy), and people who come looking for a pedigree simulator will figure out what we're actually doing very quickly.

jeromekelleher commented 3 years ago

I'm with @petrelharp - we can't be both precise and understandable to non-experts. I'm still a bit queasy about "coalescent" because we are also simulating mutations, which have nothing to do with the coalescent. So, what if we drop it, like,

Simulate genealogical trees and genomic sequence data using population genetic models.

jeromekelleher commented 3 years ago

Good points above @castedo, but I think the goal here is to state what msprime does, rather than "what is msprime better at than other things". I think the job of convincing someone to use msprime vs other simulators is done elsewhere.

mmatschiner commented 3 years ago

Suggesting a small modification so that it sounds less like a command: Simulating genealogical trees and genomic sequence data using population genetic models.

bhaller commented 3 years ago

I'm with @petrelharp - we can't be both precise and understandable to non-experts. I'm still a bit queasy about "coalescent" because we are also simulating mutations, which have nothing to do with the coalescent. So, what if we drop it, like,

Simulate genealogical trees and genomic sequence data using population genetic models.

But then that description would pretty much apply to SLiM, too, right? It's too generic, and doesn't make clear what is different about msprime. The fact that it's backwards-in-time is important and needs to be in there. How about "backwards-in-time" rather than "coalescent" or "retrospective"?

jeromekelleher commented 3 years ago

Good discussion, thanks all!

@bhaller, yes, it would also apply to SLiM or any other simulator, more or less. I guess I'm thinking about someone random who drops in on the package/GitHub repo and wants to know what the package does. Follow up sentences can go into more detail about how it does these things and how it relates to other tools, but I'd like Generic GitHub User to have an idea of what the package is for by reading the first sentence.

I guess you could see it as a filtering process - first sentence gets rid of anyone who isn't interested in simulating DNA data or ancestral histories.

hyanwong commented 3 years ago

I'm still a bit queasy about "coalescent" because we are also simulating mutations, which have nothing to do with the coalescent. So, what if we drop it, like,

Simulate genealogical trees and genomic sequence data using population genetic models.

But then that description would pretty much apply to SLiM, too, right? It's too generic, and doesn't make clear what is different about msprime. The fact that it's backwards-in-time is important and needs to be in there. How about "backwards-in-time" rather than "coalescent" or "retrospective"?

Could we say "coalescent-based population genetic models"? Most of the demographic models use coalescence theory, right? But it's just that they aren't "the coalescent"? Perhaps not the selection ones, I guess?

hyanwong commented 3 years ago

Suggesting a small modification so that it sounds less like a command: Simulating genealogical trees and genomic sequence data using population genetic models.

I actually prefer "simulate" - it's more direct and easier to read. You could also read it as "You can use this to ... simulate etc etc.". When I see "Simulating XXX" I feel it should be followed by a reason, e.g.

Simulating genealogical trees and genomic sequence data using population genetic models, for the greater good.

😀

castedo commented 3 years ago

Makes sense to me what @jeromekelleher said about filtering and focusing on the "what" rather than the "why-different-or-better". So filtering down to the "what" and throwing in something very different here:

"Simulate descent of DNA sequence mutations from common population ancestors"

I'm throwing that out there just to mix in something very different. The small variants of the original tag line sound fine to me.

jeromekelleher commented 3 years ago

We're forgetting about mutations here - that's why I don't want to say "coalescent" or whatever. We have really powerful mutation generation abilities, these shouldn't be an afterthought.

hyanwong commented 3 years ago

We're forgetting about mutations here - that's why I don't want to say "coalescent" or whatever. We have really powerful mutation generation abilities, these shouldn't be an afterthought.

Is that an argument for explicitly stating this? So, for example, someone with some SLiM tree sequences would realise that they can go to msprime to overlay mutations. E.g.

"Simulate genealogical trees and genomic sequence data via coalescent-based population genetic models and flexible mutation models"

or does that make it too long for this purpose?

petrelharp commented 3 years ago

We say "DNA sequence data", which implies there's got to be mutations in there somewhere.

benjeffery commented 3 years ago

+1 for "Simulate genealogical trees and genomic sequence data using population genetic models."

I considered "Simulate genealogical trees and their genomic sequence data using population genetic models." to highlight the relationship, but it is already quite wordy.

jeromekelleher commented 3 years ago

I'm going to go with "Simulate genealogical trees and genomic sequence data using population genetic models." I'll clarify things as appropriate in the various contexts, but I think this about as good as we'll do in 10 words (which is as much as is useful - this is going to be displayed as the short description in lots of limited-space contexts).

Thanks of the input all!