Add "SARS-Cov-2 ARGs: a case study" section at the end

jeromekelleher commented 1 year ago

Currently we only mention sc2ts in a an "in passing" sort of way in the Discussion, but I think we could/should be much more explicit about how much this reinforces the points we're trying to make. So, how about we add a new section, called "SARS-Cov-2 ARGs: a case study" (or something) which goes something like this:

Quick intro to sars2, and scale and type of data collected and importance of recombination.
Quick summary of overall contribution of the preprint (demonstrating scale, etc)
Discuss the XA example where we could be quite precise about the recombination event. Contrast with another recombination, where there's probably multiple stacked together. Perhaps we could include a version for Figure 6, and use XD as the example where multiple recombinations occur?
Emphasise the point that even though we have exquisitely detailed and complete data (by far and away the most detailed picture of the ongoing evolution of any organism) we still can't be particularly precise about the details of recombination events, and that having a model-free representation of the history that allows us to directly express the uncertainty is very useful.
Finish up on the point that the current gARG/tskit encoding forces false precision about recombination breakpoints locations. For example in XA, we can only say that the recomb fell in some region of X KB of the genome. This is a fundamental limitation because the sequence diversity simply isn't there to resolve further. Even in this case with perfect data, we will just never know exactly where this recomb occured, and are simply guessing if we put any more precise location on it. A data structure that allows us to systematically reason about the uncertainty in genome location of breakpoints, as well as the temporal ordering, would be a useful contribution.

Link to preep for reference: https://www.biorxiv.org/content/10.1101/2023.06.08.544212v1.full.pdf

gregorgorjanc commented 1 year ago

Emphasise the point that even though we have exquisitely detailed and complete data (by far and away the most detailed picture of the ongoing evolution of any organism) we still can't be particularly precise about the details of recombination events, and that having a model-free representation of the history that allows us to directly express the uncertainty is very useful.

Finish up on the point that the current gARG/tskit encoding forces false precision about recombination breakpoints locations. For example in XA, we can only say that the recomb fell in some region of X KB of the genome. This is a fundamental limitation because the sequence diversity simply isn't there to resolve further. Even in this case with perfect data, we will just never know exactly where this recomb occured, and are simply guessing if we put any more precise location on it. A data structure that allows us to systematically reason about the uncertainty in genome location of breakpoints, as well as the temporal ordering, would be a useful contribution.

How is gARG really better in this context than eARG? I am still struggling with eARG so this might be a silly question. But, the following question will be more to the point! Isn't the tree sequence (a possible gARG format) edge table effectively (and preciselly, but likely wrongly) calling the recombination breakpoints by saying what pieces of DNA were copied from one parent/ancestor versus other(s) parent(s)/ancestor(s)? Or am I missing something fundamental?

Just to add that even in pedigrees where we have WGS information on parents and progeny across multiple generations it is quite often hard to precisely call recombination position because we might not have a sufficient number of polymorphic sites around the recombination (imagine a diploid parent with haplotypes 0-0-0-0-0-0 and 1-0-0-0-0-1 and a progeny that inherited this recombined haplotype from the parent 0-0-0-0-0-1 - there was clearly a recombination (or mutation), but where exactly it happene is impossible to call).

it can be

jeromekelleher commented 1 year ago

How is gARG really better in this context than eARG?

Essentially the eARG forces a fully precise estimate of every recombination event, whereas the gARG means we can capture the uncertainty around the time ordering of multiple events. So, in an eARG we have to posit unique events for every recombination and we have to sort them by time. Polytomies in phylogenies are directly analogous. An eARG is equivalent to forcing trees to be always be binary, whereas a gARG is equivalant to allowing for polytomies

The precision of the breakpoint along the genome is a separate, really good point, and one we should discuss.

gregorgorjanc commented 1 year ago

Aha, that clarifies a lot! Indeed this explanation is perfect and should be added to the discussion;)

tskit-dev / what-is-an-arg-paper

Add "SARS-Cov-2 ARGs: a case study" section at the end #305