kfuku52 / csubst

Molecular convergence detection
BSD 3-Clause "New" or "Revised" License
25 stars 1 forks source link

Interpreting w_C = Inf values #34

Closed aleungplants closed 1 year ago

aleungplants commented 1 year ago

I am getting w_C = Inf in numerous branch pairs. So, dSC is zero. Are these real results? I don’t think I saw any values in the CSUBST paper that were infinite.

What I am understanding from the paper is that OCN is summed across sites at the branch level. At the site level, would it be correct to interpret OCN values as the probability of a non-synonymous substitution at both branches? Is it a posterior probability since the ancestral state is estimated?

Thank you for your help, and I am grateful for your time.

kfuku52 commented 1 year ago

Thank you for using CSUBST!

I am getting w_C = Inf in numerous branch pairs. So, dSC is zero. Are these real results? I don’t think I saw any values in the CSUBST paper that were infinite.

It's hard to say without checking actual data, but I think those are real estimates rather than a bug. ωC=Inf is often observed in branch combinations involving one or more branches where only a small number of substitutions happened. Such branch combinations are usually excluded by an OCN cutoff, which we recommended applying together with an ωC cutoff in the paper.

What I am understanding from the paper is that OCN is summed across sites at the branch level. At the site level, would it be correct to interpret OCN values as the probability of a non-synonymous substitution at both branches? Is it a posterior probability since the ancestral state is estimated?

Yes, OCN is a summed probability of combinatorial substitutions. I just realized that we might have confused you because we used OCN in cusbst_site.tsv, which provides site-wise probabilities. Actually, this wasn't quite right. OCN in cusbst_site.tsv is the site-wise combinatorial substitution probabilities, that are referred to as Pl(SC|D,θ) in the paper.

aleungplants commented 1 year ago

Thank you for the clarification! I was concerned that I was using the software incorrectly somehow, but it now appears to make sense in the context of my data. Right now I'm using a OCN > 1 cut off, so you're right that I'm looking at branches with very substitutions.

Background info on this gene: relative to the size of the gene, there very few substitutions in general (though strong positive selection; ω is about 10 at most of them).

So it makes sense that I'm getting ωC=Inf.

kfuku52 commented 1 year ago

I see. Under an extremely high ω, many branch combinations may have near-zero synonymous convergence, which leads to ωC=Inf.