jianyangqt / gcta

GCTA software
GNU General Public License v3.0
73 stars 23 forks source link

Merged file prior to REML solver #41

Closed dadekale closed 1 year ago

dadekale commented 1 year ago

I'm running a GWAS for a few populations and I'm running into the error below:

''' GRM for 28752 individuals are included from [.../plink.grm.bin]. 2607 individuals are in common in these files. 1 quantitative variable(s) included as covariate(s). 4 discrete variable(s) included as covariate(s).

Performing MLM association analyses (including the candidate SNP) ...

Performing REML analysis ... (Note: may take hours depending on sample size). 2607 observations, 354 fixed effect(s), and 2 variance component(s)(including residual variance). Calculating prior values of variance components by EM-REML ... Updated prior values: 3831.44 3450.76 logL: -10989.6 Running AI-REML algorithm ... Iter. logL V(G) V(e)
Error: the X^t V^-1 X matrix is not invertible. Please check the covariate(s) and/or the environmental factor(s). An error occurs, please check the options or data '''

Usually, this is associated with the colinearity in the covariates. I have checked my covariates, and I cannot find any columns that are linearly equivalent.

Context: I am running a GWAS that uses a GRM across multiple populations A phenotype file that contains missing values, A covar file that contains missing values A qcovar file that contains missing values

Since I am trying to reproduce the non-invertible error in a Python environment, it would be nice to know what animals are retained in the final merged files. My attempts to reproduce the merged files (by dropping missing values and merging all data frames).

The GCTA log files say 2607 animals were retained, and my attempt to merge retains 2623 animals. The correlation of the variables in my covar files is shown below

image

Although the numbers of the animals are not equal in the merged files (they are close enough), there do not seem to have any of my variables that are correlated enough to cause the matrix to be non-invertible.

I would appreciate a method to get a subset of the animals that go into the REML solver prior to the process terminating.

longmanz commented 1 year ago

Hi, Given your log file, you have 4 discrete covariates, but then you have "354 fixed effects" in your final model matrix. This indicates that for one/some of your discrete covariates, you have a large number of levels (each level leads to one dummy variable). To replicate this issue, you will also need to generate the same dummy variables from your discrete covariates. In R, you will need to do something like this to get the model matrix and the correlation matrix of your 354 fixed effects: model_mat = model.matrix(trait ~ factor(sex) + factor(bs365) + factor(mehrling)) cor_mat = cor(model_mat, use = "complete.obs")