rajanil / fastStructure

A variational framework for inferring population structure from SNP genotype data.
MIT License
134 stars 50 forks source link

Estimating the variational parameters with multiple restarts #57

Open ericopolo opened 4 years ago

ericopolo commented 4 years ago

In the fastStructure paper (Raj et al 2014) we read:

When population structure is difficult to resolve, imposing a logistic prior and estimating its parameters using the data are likely to increase the power to detect weak structure. However, estimation of the hierarchical prior parameters by maximizing the approximate marginal likelihood also makes the model susceptible to overfitting by encouraging a small set of samples to be randomly, and often confidently, assigned to unnecessary components of the model. To correct for this, when using the logistic prior, we suggest estimating the variational parameters with multiple random restarts and using the mean of the parameters corresponding to the top five values of LLBO. To ensure consistent population labels when computing the mean, we permuted the labels for each set of variational parameter estimates to find the permutation with the lowest pairwise Jensen–Shannon divergence between admixture proportions among pairs of restarts.

I suggest the inclusion of a detailed explanation on documentation about how to accomplish that, in practice.