privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

Repeated measures (mixed model) GWAS #159

Closed chrisraynerr closed 1 year ago

chrisraynerr commented 1 year ago

Hi Florian -- thanks so much for writing these packages. I am trying to run a repeated measures GWAS in a sample including related individuals. I have generated loco-predictors in regenie (to control for relatedness) and I am aiming to run a two step analysis - where covariates are regressed from phenotype and genotype (as done in regenie), before fitting the mixed model Yresids ~ SNPresids + (1|IID). Do you have a work around for adjusting genotype calls for PCs and storing the residuals as an FBM? I have looked at using big_univLinReg -- but looks like the FBM can only be included as X, not Y... an no option for storing resids. Otherwise, is there a way to run a linear mixed model using including all covariates using big_apply? I have tried a bunch of ways with no success -- I am working on a remote desktop and I am unable to paste code, but happy to copy it out if you don't have a previous workaround. Many thanks!!

privefl commented 1 year ago

I'm not sure I follow exactly what you want to do. Could you maybe provide some (inefficient) R code that works on a very small example?

chrisraynerr commented 1 year ago

I have a dataset with repeated outcome measurements (7 time-points). I've transformed the data to long format, and I want to run a GW-LMM, with IID as the random effect (currently neither plink nor regenie can deal with repeated IIDs). I'm trying to figure out the fastest way to do this, and was hoping to use bigstatsr/bigsnpr to load the geno data and perfrom the regression using big_parallelize (or something similar). Below is a quickly simulated dataset to show the structure of the data I'm using and a summary of the model I'm trying to use. Im wondering if I would gain speed by splitting it into steps -- step1 residualising all SNPs for PCs and using snp_save to save the resids as an FBM? then in step2 running the mixed model using resids and big_apply? Anyway.. here's an example input and the output I'm looking for... which would then be combine across blocks of SNPs. Many thanks for you're help with this!

library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)  
library(GGally)

# simulating PCs from genotype data
pc <- 
  rnorm_multi(
    n    = 10000,
    mu   = 0,
    sd   = 1,
    r    = 0,
    varnames = c(paste0("PC",1:10)),
    empirical = F
  ) %>%
  mutate(IID  = row_number())

# simulating longitudinal outcome data with age covariate
df <- 
  rnorm_multi(
    n    = 10000,
    mu   = 50,
    sd   = 10,
    r    = 0.4,
    varnames = c(paste0("Y_Q0",1:7)),
    empirical = F
  ) %>%
  mutate(
    IID  = row_number(),
    Age_Q01 = rnorm(10000, mean=35, sd=5) 
  ) %>%
  mutate(
    Age_Q02 = Age_Q01 + 0.5,
    Age_Q03 = Age_Q01 + 1,
    Age_Q04 = Age_Q01 + 3,
    Age_Q05 = Age_Q01 + 5,
    Age_Q06 = Age_Q01 + 7,
    Age_Q07 = Age_Q01 + 8
  ) 

# adjusting phenotype for genotype PCs (this 2 step process is done in regenie)
yRes <-
  df %>% 
  full_join(pc, "IID") %>%
  mutate(
    across(matches("Y"), 
    ~ rstandard(lm(.x~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10,df))
    )) %>%
  select(IID, matches("Y_|Age"))%>% 
  pivot_longer(
    cols = !IID, 
    names_to = c(".value", "Wave"), 
    names_sep = "_"
  ) 

# adjusting genotypes for PCs (again... this 2 step process is done in regenie)
gRes <- 
  rnorm_multi(
    n    = 10000,
    mu   = 0,
    sd   = 1,
    r    = 0,
    varnames = c(paste0("PC",1:10)),
    empirical = F
  ) %>%
  mutate(
    IID  = row_number(),
    SNP  = rbinom(10000, 2, 0.5)
  ) %>%
  mutate(
    SNPRes = rstandard(lm(as.formula(paste0("SNP ~ ",paste0("PC",1:10, collapse="+"))), data = .))
  ) %>%
  select(IID,SNPRes)

df <-  yRes %>% full_join(gRes)

# run mixed model... for each SNP I want the main effect of SNP and its interaction with age 
output <- 
  summary(
    lme4::lmer(Y ~ SNPRes + SNPRes*Age + (1|IID), data=df)
    )$coefficients
privefl commented 1 year ago

If you want to get a residualized version of the genotype matrix, that shouldn't be too hard. You can implement this using Rcpp or big_apply() using the linear algebra trick I'm using in big_univLinReg().

If you want to implement your own mixed model, I have no experience with this, so won't be of much help.

chrisraynerr commented 1 year ago

Ok thanks!

privefl commented 1 year ago

If you need help with this, feel free to comment and reopen.