jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

"Recombinant" or "Recombination node" #115

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

I'm going through the codebase and the manuscript, and I see we use the word "recombinant" in lots of places (e.g. class names, variable names). I suggest that most people will think of a recombinant as a sample node that has recombination in its ancestry, e.g. "I've been infected by a recombinant (virus)". So I wonder if we want to change the terminology in the code to refer to the nodes in which recombination events actually took place as "recombination nodes" instead? E.g. instead of the Recombinant class, we should have a RecombinationNode class.

It's a bit of a large change, but should be reasonably easy to make with auto search-and-replace I think. Thoughts?

szhan commented 1 year ago

I have a (very) slight preference for recombinant nodes, because these nodes represent recombinant sequences (sampled or reconstructed) or products of recombination events.

hyanwong commented 1 year ago

I have a (very) slight preference for recombinant nodes, because these nodes represent recombinant sequences (sampled or reconstructed) or products of recombination events.

Isn't a "recombinant node" simply a sample that is the product of a recombination? Since a sample is a node in itself?

jeromekelleher commented 1 year ago

I don't think adding the suffix "node" clarifies anything here. A Recombinant is something that is the product of a recent recombination, i.e., something that we have inferred requires more than one parent in its LS copying path. If we reserve "recombinant" to mean a sequence that has recombination anywhere in its history, then pretty soon all sequences will be recombinants and the term isn't very useful.

hyanwong commented 1 year ago

I'm still uneasy about this, TBH. I still associate "recombinant" with an actual sample, and "recombination node" with an inferred event in the ARG. E.g. in this plot:

image_720

I think of the black node as a "recombination node", not a "recombinant". I think of the immediate children of the black node (especially the samples) as the recombinants instead. I also suspect that lots of researchers would think of all the purple and blue nodes as "recombinant viruses"? But I take your point about everything being a recombinant eventually, and I suspect that it's usually just reserved for recent descendants of a recombination event.

Could we see what other people on the project think, perhaps?

jeromekelleher commented 1 year ago

We're splitting hairs here - the recombination node represents the original recombinant sequence. The children of a recombinant aren't necessarily recombinants, they may be later descendants of that sequence. We only insert recombination nodes as a convenience to make it easy to find what we think are unique recombinants - we may have multiple sequences from the same day which have the same copying path and we therefore "path compress" these together into a single node, representing the unique recombinant origin.

jeromekelleher commented 1 year ago

There is the distinction of the "causal" sequences for a recombinant/recombination node. These are the sequences that required a path of > 1 parent, resulting in the creation of the recombination node, which represents the original recombinant sequence.

jeromekelleher commented 1 year ago

Also, talking about "recombination nodes" is confusing because it ties to that specific ARG. We want to be able to refer to the original recombinant sequence that was the result of recombination between two strains within a host. Using "node" to refer to this sequence is misleading.

hyanwong commented 1 year ago

Also, talking about "recombination nodes" is confusing because it ties to that specific ARG. We want to be able to refer to the original recombinant sequence that was the result of recombination between two strains within a host. Using "node" to refer to this sequence is misleading.

I agree that the "recombination node" is tied to a specific ARG, and a recombinant is a sequence that results directly from a recombination. But I often want to refer to the specific (essentially, inferred) node in the specific ARG when e.g. describing a plot like the one above. I think labelling the black dot a "recombination node" here is less confusing, as it is deliberately tied to this particular ARG.

So maybe we just need to be careful to use "recombinant" to mean the result of a recombination event, and "recombination node" to mean a specific (inferred, may be wrong) node in one of our two ARGs