Interpreting w_C = Inf values

aleungplants commented 1 year ago

I am getting w_C = Inf in numerous branch pairs. So, dSC is zero. Are these real results? I don’t think I saw any values in the CSUBST paper that were infinite.

What I am understanding from the paper is that OCN is summed across sites at the branch level. At the site level, would it be correct to interpret OCN values as the probability of a non-synonymous substitution at both branches? Is it a posterior probability since the ancestral state is estimated?

Thank you for your help, and I am grateful for your time.

kfuku52 commented 1 year ago

Thank you for using CSUBST!

I am getting w_C = Inf in numerous branch pairs. So, dSC is zero. Are these real results? I don’t think I saw any values in the CSUBST paper that were infinite.

It's hard to say without checking actual data, but I think those are real estimates rather than a bug. ω_C=Inf is often observed in branch combinations involving one or more branches where only a small number of substitutions happened. Such branch combinations are usually excluded by an O_C^N cutoff, which we recommended applying together with an ω_C cutoff in the paper.

What I am understanding from the paper is that OCN is summed across sites at the branch level. At the site level, would it be correct to interpret OCN values as the probability of a non-synonymous substitution at both branches? Is it a posterior probability since the ancestral state is estimated?

Yes, O_C^N is a summed probability of combinatorial substitutions. I just realized that we might have confused you because we used O_C^N in cusbst_site.tsv, which provides site-wise probabilities. Actually, this wasn't quite right. O_C^N in cusbst_site.tsv is the site-wise combinatorial substitution probabilities, that are referred to as P_l(S_C|D,θ) in the paper.

aleungplants commented 1 year ago

Thank you for the clarification! I was concerned that I was using the software incorrectly somehow, but it now appears to make sense in the context of my data. Right now I'm using a OCN > 1 cut off, so you're right that I'm looking at branches with very substitutions.

Background info on this gene: relative to the size of the gene, there very few substitutions in general (though strong positive selection; ω is about 10 at most of them).

So it makes sense that I'm getting ωC=Inf.

kfuku52 commented 1 year ago

I see. Under an extremely high ω, many branch combinations may have near-zero synonymous convergence, which leads to ω_C=Inf.

kfuku52 / csubst

Interpreting w_C = Inf values #34