tskit-dev / msprime-1.0-paper

Publication describing msprime 1.0
4 stars 20 forks source link

Finalise tree sequence definition section #142

Closed jeromekelleher closed 3 years ago

jeromekelleher commented 3 years ago

134 and #138 tried to clarify the tree sequence definition section, which I tried to merge into #141. In particular, this added the idea of a "ploid", which we explain before going on to talk about nodes and edges. I think this is the right approach, but the current discussion of ploids is half hearted.

Anyone want to make a pass (@sgravel @petrelharp @andrewkern)?

andrewkern commented 3 years ago

happy to take a pass but i'm not a fan of this ploid language! maybe someone else should go before me.

jeromekelleher commented 3 years ago

We need a word to define this thing - haplotype is no good because it implies allelic state. Open to other ideas!

gregorgorjanc commented 3 years ago

I think we already have a nice word: genome, a set of chromosomes an individual inherits from one parent.

haploid = one genome diploid = two genomes tetraploid = four genomes etc.

So nodes in tree sequence are genomes and individuals are a collection of genomes. With this in mind I would talk about a "formation time" instead of a "birth time" (line 321).

An argument against "genome" could be that we sometimes use "genome" to denote a reference genome, but we know that each individuals' genomes are quite unique anyway so there is no "one genome".

andrewkern commented 3 years ago

So nodes in tree sequence are genomes and individuals are a collection of genomes. With this in mind I would talk about a "formation time" instead of a "birth time" (line 321).

true but while it is possible to simulate more than one unlinked chromosome using msprime, i feel like this is more confusing that it has to be. we should be aiming this part of the paper at the newcomer to tree sequences, so why not simply call nodes chromosomes?

sgravel commented 3 years ago

I think all these words have other meanings, and the overloading of meanings can become confusing. A "genome" to me refers primarily to either the shared genetic background of a species ("The human genome"), or to the full genetic material carried by an individual. I would typically talk about "a haploid copy of the genome" to disambiguate, but this gets cumbersome. "Chromosome" of course also has a distinct meaning, such that "How many chromosomes does a human individual have?" becomes ambiguous. "Haplotype" has the problems highlighted by Jerome. People also use gamete for this, which is also overloaded.

To be fair, the meaning is usually clear in the context, but I feel that we would benefit from having a word that specifically refers to one "haploid copy of the genome". I have grown fond of ploid, but I can see people not wanting to create more confusion by adding a new term with a stupid etymology.

petrelharp commented 3 years ago

Here's my previous best attempt at dismbiguating the situation: https://tskit.dev/tskit/docs/stable/data-model.html#sec-nodes-or-individuals I also like "ploid" but am not really in favor of introducing it into this paper - this doesn't feel like a "introduce new terminology" paper.

I'm happy to take a pass at this, but there's enough cooks that I won't do it unless asked.

jeromekelleher commented 3 years ago

OK, sounds like "genome" is the least disliked option on balance. I'll update the text with an adaption of Peter's version from the docs, and we can iterate from there. I think "genome" is fine once we are clear about what we mean by a genome up front (although I agree with @sgravel's point above).

jeromekelleher commented 3 years ago

I've updated the section to talk about "genomes" in #145. Does someone else volunteer to take a pass at it now, or maybe read through and comment on what they think? (Could move to overleaf for this, since we should be converging?)

gregorgorjanc commented 3 years ago

Happy to do a pass on Overleaf - to converge ...

gregorgorjanc commented 3 years ago

Done. Reads well I think. I provided some tiny edits. I would also like to note that genome appears all through the manuscript and I think it reads/links well across the places.

jeromekelleher commented 3 years ago

Excellent, thanks @gregorgorjanc, changes commited

jeromekelleher commented 3 years ago

@petrelharp @sgravel @andrewkern - how are you feeling about this section now?

petrelharp commented 3 years ago

Looks good - I even tried to make some adjustments, but then couldn't actually think of any improvments!

jeromekelleher commented 3 years ago

No complaints here, so closing.