tskit-dev / what-is-an-arg-paper

Manuscript and code for the "What is an ARG?" paper
2 stars 8 forks source link

Collect incorrect statements about tskit/tsinfer in the published record #38

Closed jeromekelleher closed 1 year ago

jeromekelleher commented 2 years ago

It's important to try and clear up some of the confusion. Here's one:

https://doi.org/10.1016/j.tig.2019.12.008

Notably, tsinfer does not explicitly infer an ARG but rather a sequence of local gene trees, described by their topologies only.

Let's collect others from published articles and preprints. We probably won't refer to the preprints in the article, but good to collect them here anyway for reference.

jeromekelleher commented 2 years ago

We note that ARGs contain more information than local trees, but there is no obvious way of comparing ARG topologies (and tsinfer only infers local trees, rather than full ARGs)

Here's one from the KwARG paper (doi: 10.1093/bioinformatics/btab351), section 3.2.2

awohns commented 2 years ago

The most computationally efficient approach, tsinfer (8), also scales to large datasets but assumes that frequency of an allele is correlated with its age. Since this assumption is violated at loci undergoing either admixture or selection, tsinfer is not well suited for ARG inference using genetic data from Neanderthals, Denisovans, and modern humans.

From the Sarge paper https://www.science.org/doi/10.1126/sciadv.abc0776

jeromekelleher commented 2 years ago

That one takes a bit more unpicking @awohns than the others, doesn't it @awohns? Also, we already have a pretty lengthy refutation on the topic :smile:

The goal in my mind here is to have a sentence like "There is significant confusion regarding the properties of the structure inferred by tsinfer, with several incorrect statements published [cite A, B, ...]"

awohns commented 2 years ago

A promising avenue of research is developing around new methods for approximately inferring ancestral recombination graphs (ARG) (Kelleher et al., 2019; Speidel et al., 2019), which have recently been extended to incorporate non-contemporaneous sampling (Wohns et al., 2021; Speidel et al., 2021). An ARG is a data structure which contains a detailed description of the genealogical relationships in a set of samples, including the full history of gene trees, ancestral haplotypes and recombination events which relate the samples to each other at every site in the genome (Griffiths and Marjoram, 1997).

Here's a subtle one: the output of tsinfer and relate are described as ARGs, and then they go on to specify that the ARG includes recombination events, while in the tree sequence we see the consequence of the event. In Quantitative Human Paleogenetics: what can ancient DNA tell us about complex trait evolution?

awohns commented 2 years ago

Here the authors state that tsinfer infers "independent gene trees":

Modern population genetic approaches take advantage of hundreds of thousands of independent gene trees (obtained from whole-genome sequencing and genome-wide SNP genotyping), gaining several orders of magnitude in statistical power (Kelleher et al., 2019).

From Human origins in Southern African palaeo-wetlands? Strong claims from weak evidence

Other papers (for instance here and here) simply say that tsinfer infers "local genealogies"

jeromekelleher commented 2 years ago

Very good - these are subtle ones aren't they? I think the first one (Irving-Pease et al) can be put down to general confusion about what an ARG is. This would be good to cite as an example of people using the term "ARG" to mean "something ARG-like". We're essentially arguing that people are using the term ARG to refer to things that are definitely not Griffiths graphs but are ARG-like, so we're providing an updated definition that does encapsulate all of the "ARG-like" things that they want to discuss.

The "independent gene trees" thing is tricky. I guess any method does infer some independent gene trees (since the ones at either end of a chromosome arm will be basically independent), but that's not really what they mean I guess. It's hard to point out "local genealogies" as being actively wrong, since it is strictly true.

jeromekelleher commented 1 year ago

I'm going to close this as done I think. No much point in harping on about it, more important to get across the point that there exists a gradation of detail that can be inferred about recomb

hyanwong commented 1 year ago

I'm currently only citing hejase2020summary and ignatieva2021kwarg. Do we want to add more? Is it weird to cite Ana here?!

a common statement in the literature is that modern tools such as \tsinfer and \relate construct not an ARG but ``only'' a sequence of local trees \citep{hejase2020summary, ignatieva2021kwarg}

a-ignatieva commented 1 year ago

Why not cite all the ones above that say local genealogies? It's a commonly stated thing so maybe good to include plenty of examples?

jeromekelleher commented 1 year ago

SGTM. So long as we're fairly neutral about how we phrase, then giving lots of examples is good, probably

a-ignatieva commented 1 year ago

"Relate simplifies the problem of ARG inference by inferring marginal coalescence trees, instead of full ARGs" Brandt et al 2022 is another we could include

a-ignatieva commented 1 year ago

FWIW I don't think many of the statements in this issue are "incorrect" (under the various definitions of ARG in each paper) :-)

jeromekelleher commented 1 year ago

Sure, "incorrect" is certainly not neutral phrasing!

We could just say something like "there is significant confusion about the properties of [things] inferred by tsinfer and Relate", and cite a bunch of paper. Basically say something very bland, and then list a bunch of papers where they imply the outputs are "Just a Bunch of Trees"

hyanwong commented 1 year ago

FWIW I don't think many of the statements in this issue are "incorrect" (under the various definitions of ARG in each paper) :-)

I think some of them are potentially misleading, though. if you say "only infers local trees" then people do think this is not inferring a graph.

They also think that the method involves some sort of local tree-by-tree reconstruction algorithms, which is not true for tsinfer.

jeromekelleher commented 1 year ago

I tend to agree with Yan, but I also see the wisdom of not getting into a bunfight over arguable forms of wording... How about we do something like this:

[A sentence recapping that tsinfer and Relate definitely do infer ARGs, with capturing significant correlation structure across trees via the graph structure]. Many discussions of these methods, however, lack this nuance, suggesting rather that they infer [disconnected, independent, unrelated] local trees [A bunch of citations].

And leave it at that? Saying "suggesting" let's us be quite broad in our interpretation, which not really pointing fingers at anyone saying they're "wrong"

a-ignatieva commented 1 year ago

Is this in addition to the first para of the example inferred args section? We tried to make that make this point.

jeromekelleher commented 1 year ago

I think we should probably delete the first para of that section, it's a bit repetitive with later content, and the section is a bit long currently.

Making the point that people have missed important subtlety after we've pointed out what the subtlety is, is better than starting off with a negative point.

a-ignatieva commented 1 year ago

I guess my problem is that I don't think when people say full ARG = trees + events, and that methods X infer only the trees, they necessarily imply the trees are disconnected/unrelated/independent (indeed some papers say "correlated" local trees). Is this not worth explaining? Only takes an extra sentence. (This is with reference to your suggested wording)

jeromekelleher commented 1 year ago

Sure, happy to discuss suggestions. I do think we have to separate what people might have meant from what they actually said, though.

jeromekelleher commented 1 year ago

There's a place in the current version of s9 where we could get into this: https://github.com/tskit-dev/what-is-an-arg-paper/blob/c5db73f6b131c4ed81ce4857e12ecc9e892db727/paper.tex#L1259

But, we could also just not bother. Hopefully the figure and the rest of the section make it clear that its much more nuanced than that, and we don't really need to go pointing out where people said stuff that was wrong?

hyanwong commented 1 year ago

Maybe we just need to (re)emphasise that they are "valid" graphs (i.e. networks joined by shared nodes), e.g.

Nonetheless, both methods produce estimates in which nodes and edges persist across multiple trees, creating fully valid inheritance graphs which fit seamlessly into the gARG formulation..

hyanwong commented 1 year ago

(p.s. I really like Jerome's phrasing "Neither method infers explicit recombination events, and therefore their outputs cannot be described using the classical eARG formalisms (Section~\ref{sec-eARG}).")

hyanwong commented 1 year ago

Comment placed in the Overleaf doc, so closing this.