LD Metrics Using 1000 Genomes and Beta Values Returning NA

privefl / bigsnpr

R package for the analysis of massive SNP arrays.

https://privefl.github.io/bigsnpr/

196 stars 44 forks source link

Open Taewoong-Ha opened 1 month ago

Taewoong-Ha commented 1 month ago

Hi,

I am currently building population-specific PRS using LDpred2, and I have a couple of questions:

It is recommended to use at least 2,000 individuals to build LD matrices. I am using the 1000 Genomes Project populations (EUR, EAS, SAS, AMR, AFR) to build LD matrices for different ancestries. I have seen some papers following a similar approach, but each population has around 500 individuals on average. Is this okay to proceed with, or should I use the LD metrics provided by LDpred2, such as HM3 and HM3+, regardless of ancestry?
I am using the LDpred2 grid model, but when the parameter "p" is low, all the beta values come out as NA, and consequently, the PRS also results in NA values. -> I saw a similar issue where the answer was that this can happen when "p" is low. Is this really fine? Could the small sample size used for building the LD metrics be contributing to this issue? Would using return_sampling_betas = TRUE help resolve this issue?

Thank you for your help!

privefl commented 1 month ago

The sample sizes per (sub-continental) population in 1kGP are very low (between 100 and 200) ; I would recommend using the UK Biobank instead.
I guess LDpred2 needs more regularization when using some small N to compute LD. It means using smaller h2 and larger p values in LDpred2-grid, and smaller shrink_corr in LDpred2-auto.
Parameter return_sampling_betas is used for something different.

Taewoong-Ha commented 1 month ago

I am using the 1000 Genomes Project with the following populations: EUR (n=503), EAS (n=504), SAS (n=489), AFR (n=661), and AMR (n=347). Are these sample sizes considered too small? Since I plan to use the UK Biobank for this analysis, it seems challenging to calculate LD by ancestry. Do you have any alternative methods in mind?
You mentioned that when using a small sample size, more regularization is needed for computing LD. What methods would you recommend for this?

Thank you for your assistance!

privefl commented 1 month ago

The EUR populations of the 1000G is basically 4 different sub-continental populations. Which population do you need the LD from?
More regularization is needed in LDpred2 when a small samples size is used to compute the LD I think.

Taewoong-Ha commented 1 month ago

When computing LD with a small sample size in LDpred2-grid, it is recommended to use a smaller h2 and larger p values. Are there any recommended smaller h2 and larger p values for different sample sizes
I am currently using five major populations (EUR, EAS, SAS, AFR, AMR) without subdividing into sub-continental groups. Please refer to the diagram below for more details.
Additionally, I will attach the paper I have been referencing. https://doi.org/10.1038/s41591-023-02429-x