stephenslab / ldshrink

ldshrink: a one-stop R package for shrinkage estimation of linkage disequilibrium
12 stars 1 forks source link

Return SNP info along with LD matrix #10

Open xiangzhu opened 6 years ago

xiangzhu commented 6 years ago

It seems the main function only returns an estimated LD matrix at this point? https://github.com/stephenslab/LDshrink/blob/32b4ad3942f7cb429f23c529b86ab72cfbb1b257/R/LDshrink.R#L6

Ideally we want to have some basic SNP info available (e.g. position, allele), which is essential in combining LD with GWAS summary statistics in analyses.

I think the emeraLD package gives us a good example: https://github.com/statgen/emeraLD

> source('emeraLD2R.r');
Loading required package: data.table
data.table 1.11.4  Latest news: http://r-datatable.com
emeraLD v0.1 (c) 2018 corbin quick (corbinq@gmail.com)

reading from m3vcf file...

processed genotype data for 5008 haplotypes...

calculating LD for 60 SNPs...

done!! thanks for using emeraLD

> names(ld_data)
[1] "Sigma" "info"

> head(ld_data$info)
   chr   pos          id ref alt
1:  20 83061 rs549711487   C   T
2:  20 83196  rs62190472   A   T
3:  20 83252   rs6137896   G   C
4:  20 83570   rs6048967   T   G
5:  20 83611 rs114000219   C   A
6:  20 83792 rs529518485   A   G

> head(ld_data$Sigma[, 1:5], 5)
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,]  1.00000 -0.00602  0.03989 -0.00824 -0.00331
[2,] -0.00602  1.00000 -0.14013 -0.03102 -0.01245
[3,]  0.03989 -0.14013  1.00000 -0.05714 -0.04400
[4,] -0.00824 -0.03102 -0.05714  1.00000 -0.01704
[5,] -0.00331 -0.01245 -0.04400 -0.01704  1.00000
CreRecombinase commented 6 years ago

I agree that returning some kind of info about SNPs would be useful. I don't think it's useful or necessary to enforce that though. One thing that would be quick and easy would be to add colnames and rownames to the LD matrix that match the colnames of the input SNP matrix, that way the user has the option of getting back SNP information, but doesn't need to make up fake SNP information if they don't have any (which comes up pretty often)

xiangzhu commented 6 years ago

@CreRecombinase yes, colnames and rownames seem to be sufficient in most cases.

Totally agree the following:

that way the user has the option of getting back SNP information, but doesn't need to make up fake SNP information if they don't have any (which comes up pretty often)

LDshrink doesn't have to give a snp_info when users don't have any.

There is one use case that having snp_info seems necessary. Suppose one analyst needs to analyze GWAS summary data of two traits together with LD estimates. For many SNPs, the ALT and REF alleles are different between the two traits. To properly flip the sign of betahat and/or LD estimates, we need the ALT and REF info.

However, this won't be necessary if the analyst has already unified the ALT and REF of all GWAS summary data files before using LDshrink.

Finally, I think emeraLD can easily pull out snp_info because it uses vcf as input, and vcf already contains snp_info.