privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

LD score regression, variable number of ld_size #456

Closed HerefordGuy closed 5 months ago

HerefordGuy commented 11 months ago

On line 76 of ldsc.R, the code requires that ld_size be a single number. This is only true if the number of neighboring SNPs is set. But, if size in snp_ld_scores() is a physical or genetic distance, the number of SNPs used to calculate LD scores will vary for each SNP. Could ld_size be set to be a vector of same length as ld_score?

privefl commented 11 months ago

No, that's really the total number of variants considered.

HerefordGuy commented 11 months ago

The total number in the GWAS, correct? Thanks for clarifying!

privefl commented 11 months ago

No, ld_size is the number of variants that were used to compute LD scores. Isn't what it says in the documentation?

HerefordGuy commented 11 months ago

The number of variants used to compute an LD score varies from SNP to SNP. For example, from LD scores computed in GCTA the number of SNPs used to calculate an LD score varied from 5 to 2805. That is why GCTA has a column where they report "snp_num". GCTA-LDS: calculating LD score for each SNP bigsnpr doesn't seem to be able to handle variable number of variants used to calculate LD scores.

privefl commented 11 months ago

It could easily, but I think it is not supposed to be used like that; where did you see that?

HerefordGuy commented 11 months ago

If you fit a genetic distance or physical distance threshold, then the number of SNPs used to calculate the LD score is going to vary. If you set a fixed number of SNPs (leave infos.pos = NULL), then the size of the windows to calculate the LD score are going to vary greatly from SNP to SNP. Bulik-Sullivan et al. 2015 used a 1-cM window. https://doi.org/10.1038/ng.3211 I am probably misunderstanding something. My apologies.

privefl commented 11 months ago

In equation (1), my understanding was that M is the total number of variants for which you computed the LD scores, not the number used to compute the LD scores in each window.

privefl commented 7 months ago

Any update on this?