tskit-dev / what-is-an-arg-paper

Manuscript and code for the "What is an ARG?" paper
2 stars 8 forks source link

Place extensive tskit history section into tskit paper #308

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

This feels a bit out of place (especially the "In retrospect" para), given that we are not trying to "sell" tskit in this paper. We can simply write a couple of sentences with a forward ref to "in prep." or whatever.

I worry that having a lot of tskit marketing will put off users of other ARG software.

jeromekelleher commented 1 year ago

Which bit is this? The main reason for going into tsit history was to explain the source of the "tree sequence" terminology, which is a persistent source of confusion that needs to be addressed

hyanwong commented 1 year ago

Here's what we currently have, in case we want to re-use any in the tskit paper:

The gARG encoding discussed and concretely defined in this manuscript leads to highly efficient storage and processing of ARG data, and has already been in use for several years. The succinct tree sequence data structure (usually known as a "tree sequence" for brevity, but see below for confusion around this point) is a concrete encoding of a gARG, as discussed in this article. It was originally developed as part of the \texttt{msprime} simulator \citep{kelleher2016efficient} and subsequently been extended and applied to forward-time simulations \citep{kelleher2018efficient,haller2018tree}, calculation of population genetics statistics \citep{ralph2020efficiently}. The succinct tree sequence encoding extends the basic definition of a gARG provided here by stipulating a simple tabular representation and nodes and edges, and also defining a concise and lossless representation of sequence variation using the "site" and "mutation" tables. The \texttt{tskit} library \citep{tskit2023tskit} is a liberally licensed open source toolkit that provides a comprehensive suite of tools for working with gARGs (encoded as a succinct tree sequence). Based on core functionality written in C, it provides interfaces in C, Python and Rust. Tskit is mature software, widely used in population genetics, and has been incorporated into several downstream applications \citep[e.g.,][]{haller2019slim,speidel2019method, adrion2020community, terasaki2021geonomics, baumdicker2021efficient, fan2022genealogical,korfmann2022weak, mahmoudi2022bayesian,petr2022slendr,rasmussen2022espalier, zhang2023biobank,nowbandegani2023extremely}.

% FIXME this needs more work and doesn't really fit into the % narrative right now, but we do need to make this point. The key insight that makes the succinct tree sequence encoding an efficient substrate for defining analysis algorithms is that it allows us to generate the local trees along the genome in a way that allows us to reason about the \emph{differences} between those trees. Sequentially generating the marginal trees along the genome is also fundamental, and is necessary whenever we need to perform calculations that are contingent on more than just the isolated properties of an edge. \cite{kelleher2016efficient} showed how all trees can be sequentially generated in constant time per tree transition in a fully simplified gARG. Furthermore, we can easily reason about how tree topologies change (and stay the same), leading to efficient algorithms for computing population genetic statistics \citep{kelleher2016efficient,ralph2020efficiently}, implementing the Li and Stephens model \citep{kelleher2019inferring,wohns2022unified} and likely many more.

In retrospect, the choice of terminology around the succinct tree sequence data structure was unfortunate, and has led to significant confusion. As first introduced by \cite{kelleher2016efficient} a "tree sequence" was defined as the set of "coalescence records" that is output by Hudson's simulation algorithm. This was then generalised to use the tabular encoding described above and formally described as a "succinct tree sequence" \citep{kelleher2018efficient}. The "succinct" prefix here is intended to connect with the concept of succinct data structures \citep[e.g.][]{gog2014theory}, which have near-optimal space usage but still support efficient retrieval and computation. The concrete data structures of the succinct tree sequence, precisely defined and with a particular focus on computational efficiency did not have an obvious connection to the ARG definitions provided by Griffiths and colleagues \citep{griffiths1991two,ethier1990two, griffiths1996ancestral,griffiths1997ancestral}, and with the lack of any other precise definitions, it seemed best to avoid the term to prevent confusion. Unfortunately, however, the opposite has occured. In describing the output of \tsinfer\ as a ``tree sequence'' \citep{kelleher2019inferring} many have---not unreasonably, but incorrectly---concluded that the output is a collection of \emph{independent} % MORE? % https://github.com/tskit-dev/what-is-an-arg-paper/issues/38 trees \citep[e.g.,][]{hejase2020summary,ignatieva2021kwarg}. % Could say more here, but maybe we'll have a note about terminology % in the discussion

hyanwong commented 1 year ago

Which bit is this? The main reason for going into tsit history was to explain the source of the "tree sequence" terminology, which is a persistent source of confusion that needs to be addressed

Yep, I think it just needs one or two sentences though. We could even put it in the "Equivalance of ARGs and the set of local trees" part?

jeromekelleher commented 1 year ago

We need to discuss and explain tskit some amount, otherwise we'll surely just confuse people even more? There's only three paragraphs here. I agree we can tone things down a bit, but we have to give some concrete description of what tskit is and how it relates to gARGs or we're going to be dealing with even more confused people.

hyanwong commented 1 year ago

We need to discuss and explain tskit some amount, otherwise we'll surely just confuse people even more?

Yep, I agree.

There's only three paragraphs here. I agree we can tone things down a bit,

That's what I think would be helpful. 3 paras is still quite a lot, in my mind (and I think @a-ignatieva and Jere agree)

jeromekelleher commented 1 year ago

Reading it again, I'm not sure what it is you want to cut here? We have to explain the background here, people will come away thinking that tskit doesn't do ARGs unless we are totally explicit about it.

jeromekelleher commented 1 year ago

I dropped the "in retrospect" para; hopefully we won't need to say that stuff as the rest of the paper will make everything totally clear and obvious. The rest will need a bit of updating as well