Correlation between missing data and rate

thej022214 / corHMM

Fits a generalized form of the covarion model that allows different transition rate classes on different portions of a phylogeny by treating rate classes as “hidden” states in a Markov process.

11 stars 13 forks source link

Correlation between missing data and rate #37

Closed NataliiaHue closed 3 years ago

NataliiaHue commented 3 years ago

I work on language data and have a data set of 171 binary features coded for 60 languages. Last year (appr. July 2020) I used the package to estimate the rates of feature gain and loss and had NA marked as "-" in my data. I did not have a correlation between missing data and rates (or only negligible). On Oct 22nd 2021 I reran the analysis, but got completely different rates and a correlation between the proportion of missing data and rate (0.49 for q10 and 0.23 for q01). Following what I read on the missing data in the commits to the package, I changed the marking of missing data to "?" and reran the analysis again on Oct 29th, but it doesn't seem to have solved the issue: the results look the same as those from Oct 22nd. This means I cannot reproduce the study I did 2020 and I get high correlations with missing data, which prevents me from interpreting rates in a sensible way. What changes to the package might have led to all of this? What can I do to fix this issue (or is it beyond what I can do)? missing_data_vs_rates

jboyko commented 3 years ago

Hi,

You mentioned that you got different rates. Could you rerun your analysis and check that you get the same rates? One possibility is that your likelihood surface has many local optima (something that is more likely datasets have missing data). This means that each time your run an analysis you could get slightly different results. And with different likelihood estimates you'll end up with different results downstream of the analysis.

If you do find different rate estimates when you rerun analysis you can use the nstarts argument in corHMM to increase the number of random restarts and hopefully improve the consistency of the results (unfortunately there aren't many better options in this case - it's just the nature of the data). However, if you find that you always get the same rate estimates and likelihood, it's probably something to do with the corHMM update and I'll take a closer look at what's changed in the code.

Best, James

NataliiaHue commented 3 years ago

Hi James,

thank you for the prompt response! The likelihoods and the rates were completely identical on Oct 22 and Oct 29, but different from those 2020. I would expect slight variation in the results back then and now, but no overlap at all seems suspicious to me. I can't say, which rates are more likely to be "true". Do you have an idea on why I get a correlation between the proportion of missing data and rate?

Thank you!

Best wishes, Nataliia

jboyko commented 3 years ago

Hi Nataliia,

A correlation between missing data and rate has not been found before so far as I know. But, it could make sense given that missing data is treated as any/all of the possible observed states which could inflate rate estimates because transitions between states may be more common.

Would you be able to send me sample code and your dataset (jboyko@uark[.]edu)? I'd like to look more into the discrepancy between the 2020 and 2021 versions.

Best, James

NataliiaHue commented 3 years ago

Hi James,

it turns out I had a bug in my code, which means that the issue indeed got solved on Oct 29th with the replacement of "-" with "&" for NA. Thank you for your help! We can close the issue now.

Best, Nataliia