stephenslab / susieR

R package for "sum of single effects" regression.
https://stephenslab.github.io/susieR
Other
174 stars 44 forks source link

Can correlation matrix be calculated from plink -r? #87

Closed xinyixinyijiang closed 5 years ago

xinyixinyijiang commented 5 years ago

Hi SuSie!

I am currently working on applying SuSie to fine-mapping a genomic region for a quantitative trait. I calculated z scores from summary data (about 8000 SNPs) and generated the LD matrix by plink ( -r ) from bfile, hoping to use susie_z(z_scores, R = R, L = 5).

But errors came out: Error in check_r_matrix(R, length(z), r_tol) : R is not a positive semidefinite matrix. I checked the eigenvalues and there are 2000+ negative values.

Sorry for not knowing much about the positive semidefinite matrix. I am wondering whether this is because my manipulation of the correlation matrix is wrong.

Many thanks, Xinyi

gaow commented 5 years ago

@VivianJiangxinyi Yes. R is correlation matrix (not squared correlation). I assume PLINK output should be usable and should be positive semidifinite. But there might be numerical issues. What if you set r_tol in the susie_z function to a larger value, say 1E-4?

zouyuxin commented 5 years ago

@VivianJiangxinyi We updated susie_z part recently, but we haven't fully test it. I suggest updating the package or wait for a while until we finalized.

xinyixinyijiang commented 5 years ago

Okay! Thank you for that prompt reply! I have reset r_tol to 1e-4, 1e-3,1e-2 but errors still came out. BTW, there are some duplicated SNPs (about 60+ ) when I calculate the matrix, because these SNPs have more than one possible alternative allele passing the quality control (MAF, impuation quality, hwe etc). Can this be the problem?

gaow commented 5 years ago

I have reset r_tol to 1e-4, 1e-3,1e-2 but errors still came out.

Okay I was assuming the error you ran into are due to very small negative eigen values (whose absolute value is smaller than r_tol). Then could you try what @zouyuxin just suggested (update susieR) to use an experiemental new implementation of susie_z?

there are some duplicated SNPs (about 60+ ) when I calculate the matrix, because these SNPs have more than one possible alternative allele passing the quality control (MAF, impuation quality, hwe etc). Can this be the problem?

Are you talking about cases with third alleles? In reference panel? And your handling of third allele is to treat them as two variants each evaluated for the dosage of one of the two alternative alleles? That case the variants are not duplicates right? I do not think this can be a problem. But it is very important that the reference panel processed like this matches the summary statistics data by both SNP position and alternative allele!

xinyixinyijiang commented 5 years ago

That makes sense! And yes the summary statistics data matches. I rerun with the updated susie_z with r_tol equals 1e-4 and this time no errors! Many thanks!

stephens999 commented 5 years ago

Are there missing data in the genotypes? Otherwise I am a bit surprised to see do many quite negative values....

VitorAguiar commented 1 year ago

Hello,

I also get the warning that the matrix R is not positive semidefinite when computing R from the European set of 1000 Genomes (N = 503) using plink --r square, no missing values in the genotype matrix (bcftools view --genotype ^miss). I'm using susieR v0.12.35.

For example, in a 500kb region of chr1 containing 843 variants I get the first 581 eigenvalues positive and the last 262 values are negative ranging from -2e-05 to -3.8e-20.

If I set r_tol = 1e-04 the warning goes away. Is that expected or should I be concerned? I mean, I know that the reference panel is small, but I just want to test the analysis until I get access to larger datasets.

Thank you!

pcarbo commented 1 year ago

@VitorAguiar This is expected when you compute R from a small number of (possibly very correlated) SNPs, and I would not be concerned.