Using dosages from text file or VCF

dylkot commented 2 years ago

Hi there, thanks for this great tool! I am trying to use dosage information in my analysis. If I create a text file of the dosages, it seems I can't use parallelization (getting an error message saying parallelization isn't implemented for text files). Further, I am getting segfaults when I run the analysis:

glmm.score(model0, infile = genotypefn, outfile=outfn, ncores=1)

gives


 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: glmm.score(model0, infile = genotypefn, outfile = outfn, ncores = ncores)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)

This is for 100 variants X 2000 samples. Alternatively if I try to use a VCF file like so:

SeqArray::seqVCF2GDS(vcffn, gdsfn, parallel=TRUE, fmt.import="DS")
glmm.score(model0, infile = gdsfn, outfile=outfn, ncores=7)

it completes but doesn't seem to be using the dosage information. Any idea if it is possible to use the dosage information from the VCF file and if there is an obvious problem with the text file? Or any other work arounds to use dosage information, ideally with parallelization?

Thanks!

hanchenphd commented 2 years ago

Thank you for your interest in GMMAT! Please try adding is.dosage = TRUE in glmm.score when you read the dosage from a GDS file.

Best, Han

dylkot commented 2 years ago

I think that fixes it for the VCF file approach. It runs smoothly and returns results that are subtly different from the hard thresholded results and with signs flipped. The text file approach just runs really slowly and hasn't returned anything after an hour or two which doesn't really make sense since it is only about 150 variants (it completed in about 10 seconds for the VCF version). The first few columns/lines of the file are like below (and it is tab delimited):

1:723918[b37]G,A        G       A       2.0     2.0     2       1.0     1       2.0     2.0     2       2.0     2.0     2.0     2.0     2.0     2.0     1.0     2       2.0
1:724169[b37]A,G        A       G       2.0     2.0     2       2.0     2       2.0     2.0     2       2.0     2.0     2.0     2.0     2.0     2.0     2.0     2       2.0
1:725286[b37]G,A        G       A       1.0     2.0     2       2.0     2       2.0     2.0     2       2.0     2.0     2.0     2.0     2.0     2.0     2.0     2       1.0
1:729940[b37]C,G        C       G       1.0     2.0     1       2.0     2       2.0     2.0     1       1.0     2.0     2.0     2.0     2.0     2.0     1.0     2       1.15
1:730232[b37]C,T        C       T       2.0     2.0     2       2.0     2       2.0     2.0     2       1.0     2.0     2.0     2.0     2.0     2.0     1.0     2       2.0

Does that seem right to you?

Apologies for an unrelated add-on question, but it isn't really a github issue worth opening an issue for. But do you still recommend using principle components to correct for population stratification like you did in the GMMAT paper? I've seen a few other mixed linear models analyses where they don't include PCs assuming that that should be captured in the random effects, as long as you use the full set of variants to calculate the relatedness matrix rather than just a pruned subset.

Thanks so much!

hanchenphd commented 2 years ago

You might need to skip the first three columns using infile.ncol.skip = 3. See ?glmm.score for details.

Regarding your second question, see Tucker, Price and Berger 2014. If there is strong confounding by population structure, I would always include top PCs as fixed-effects covariates (it probably does not matter if confounding is weak).

Best, Han

dylkot commented 2 years ago

Got it, thank you!! Apologies, that wasn't very clear form section 4.3 here but I see it clearly in ?glmm.score

hanchenphd commented 2 years ago

Got it, thank you!! Apologies, that wasn't very clear form section 4.3 here but I see it clearly in ?glmm.score

Also see an example of using glmm.score with a plain text genotype file in Section 5.2.1 of the user manual.

Best, Han

dylkot commented 2 years ago

Thank you, yes I totally missed that!

Best, Dylan

hanchenphd / GMMAT

Using dosages from text file or VCF #39