tothuhien / lsd-0.3beta

GNU General Public License v3.0
12 stars 3 forks source link

Underestimation of substitution rates #11

Closed marcbennedbaek closed 4 years ago

marcbennedbaek commented 4 years ago

Hi tothuhien,

I am experiencing a high underestimation of substitution rates when using LSD (2-3 fold under the expected range). I have tried to estimate the substitution rate in the LSD web application and TempEst with the same input data. The LSD web application and TempEst gives reasonable estimates, but command line LSD seem to underestimate by a lot? I have sent you an email with further details, and thank you for the assistance.

Best regards, Marc

tothuhien commented 4 years ago

Hi Marc,

Thanks for your email and details about your data.

As you've shown in the email that without using variance seems to have better results than using variance; I've redone the simulation testing the effect of variances on the performance of LSD. We indeed observed a biais when using default variance with relaxed trees. The reason is, when the tree follows a relaxed clock, the variance of each branche may be not linear to the branch length (as in our variance formula). When the variances are bad estimated, using version without variance could be more robust. However, using variance still helps to reduce errors from long branch length (if your tree as any). So we can reduce the effect of bad estimated variances by increasing the constant b in our variance formula: var_i = (b_i + b/seq_length) where b_i is the length of branch i. Previously, b is the number of sites, and set by default to be 10. With the default sequence length is 1000, it makes the proprotion of 1%. But when your sequence length is longer (it's the case of your data), you should also increase b. To simplify that, in the new version I've set b as the proportion itself and don't need the sequence length any more (but still sequence length is needed if you want the confidence intervals). I've also set b=10 (percent) by default (value from 1 to 100), following our new simulation tests this setting works better for relaxed tree and as good for strict tree. The bigger b is, the less effect of branch length on variances we have (which could helps for relaxed trees). Small b works better for strict clock trees; but for relaxed trees, don't hesitate to adjust this parameter.

For LSD web, it used an old version of lsd that may have a bug; I've also have a strange result on that with another data set. Frédéric will update the web on January next year.

Please let me know if you have concerns about this. The new updated version is on http://github.com/tothuhien/lsd2

Cheers,

Hien

marcbennedbaek commented 4 years ago

Hi Hien,

Thank you for work in fixing this issue. The adjustments in the variances has fixed this issue, as the sequence length no longer affects the variance correction. As you detail I also found that increasing the parameter og variance (-b) gave better estimates, due to my trees being relaxed and with long sequences. The addition of these changes has completely fixed the issue I have experienced. Thank you for helping with this.

Cheers, Marc