mlpe vs generalized_wishart

Hi Jordi, that's a good question that I don't have a clear answer to. The generalized Wishart uses a hyperparameter nu representing the number of loci, that effectively concentrates or flattens the loglikelihood to reflect the quantity of data used in the calculation of pairwise genetic distances. This has to be provided by the user; it's not identifiable from the data alone. There's no such analogue in the MLPE model; any error in the estimates of genetic distance is ignored (error that might be minimal if distances are calculated from genome-wide SNPs, but definitely will be non-trivial for things like microsat datasets). So, I don't really think they're comparable via loglikelihood.

To do model selection, I'd use something like cross-validation. So for example, split your samples or populations into 70/30 training/test sets; fit both models to the training data; generate predicted genetic distances for the test data; then compare the predicted vs actual genetic distances to see which model performs better on out-of-sample data. In general I think cross-validation is the way to go for model selection here. This is because MLPE or generalized Wishart or whatever will always be a crude, heuristic approximation to the actual stochastic process generating the data, which would include noise from random sample genealogies, mutations, and sequencing errors. So, IMO the best we can do is ask how well these different loss functions work for out-of-sample prediction.

As a side note, the generalized Wishart implementation in radish needs more testing before it can be fully trusted. When I wrote this package (a while back), the other implementations of the generalized Wishart that I could find were a bit different from the formula in the McCullagh paper introducing the distribution. I went with the form in the paper, which seemed to give reasonable answers on simulated data, but need to revisit it.

nspope / radish

mlpe vs generalized_wishart #4