Closed jeromekelleher closed 3 years ago
happy to take a pass but i'm not a fan of this ploid language! maybe someone else should go before me.
We need a word to define this thing - haplotype is no good because it implies allelic state. Open to other ideas!
I think we already have a nice word: genome, a set of chromosomes an individual inherits from one parent.
haploid = one genome diploid = two genomes tetraploid = four genomes etc.
So nodes in tree sequence are genomes and individuals are a collection of genomes. With this in mind I would talk about a "formation time" instead of a "birth time" (line 321).
An argument against "genome" could be that we sometimes use "genome" to denote a reference genome, but we know that each individuals' genomes are quite unique anyway so there is no "one genome".
So nodes in tree sequence are genomes and individuals are a collection of genomes. With this in mind I would talk about a "formation time" instead of a "birth time" (line 321).
true but while it is possible to simulate more than one unlinked chromosome using msprime, i feel like this is more confusing that it has to be. we should be aiming this part of the paper at the newcomer to tree sequences, so why not simply call nodes chromosomes?
I think all these words have other meanings, and the overloading of meanings can become confusing. A "genome" to me refers primarily to either the shared genetic background of a species ("The human genome"), or to the full genetic material carried by an individual. I would typically talk about "a haploid copy of the genome" to disambiguate, but this gets cumbersome. "Chromosome" of course also has a distinct meaning, such that "How many chromosomes does a human individual have?" becomes ambiguous. "Haplotype" has the problems highlighted by Jerome. People also use gamete for this, which is also overloaded.
To be fair, the meaning is usually clear in the context, but I feel that we would benefit from having a word that specifically refers to one "haploid copy of the genome". I have grown fond of ploid, but I can see people not wanting to create more confusion by adding a new term with a stupid etymology.
Here's my previous best attempt at dismbiguating the situation: https://tskit.dev/tskit/docs/stable/data-model.html#sec-nodes-or-individuals I also like "ploid" but am not really in favor of introducing it into this paper - this doesn't feel like a "introduce new terminology" paper.
I'm happy to take a pass at this, but there's enough cooks that I won't do it unless asked.
OK, sounds like "genome" is the least disliked option on balance. I'll update the text with an adaption of Peter's version from the docs, and we can iterate from there. I think "genome" is fine once we are clear about what we mean by a genome up front (although I agree with @sgravel's point above).
I've updated the section to talk about "genomes" in #145. Does someone else volunteer to take a pass at it now, or maybe read through and comment on what they think? (Could move to overleaf for this, since we should be converging?)
Happy to do a pass on Overleaf - to converge ...
Done. Reads well I think. I provided some tiny edits. I would also like to note that genome appears all through the manuscript and I think it reads/links well across the places.
Excellent, thanks @gregorgorjanc, changes commited
@petrelharp @sgravel @andrewkern - how are you feeling about this section now?
Looks good - I even tried to make some adjustments, but then couldn't actually think of any improvments!
No complaints here, so closing.
134 and #138 tried to clarify the tree sequence definition section, which I tried to merge into #141. In particular, this added the idea of a "ploid", which we explain before going on to talk about nodes and edges. I think this is the right approach, but the current discussion of ploids is half hearted.
Anyone want to make a pass (@sgravel @petrelharp @andrewkern)?