privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Running LDpred2 without internet server access #474

Closed uki-uiu closed 1 month ago

uki-uiu commented 5 months ago

i am trying to run LDpred2 using hg38 genome-wide data offline however the script does not allow me to select the latest genetic map coordinates for this purpose; POS2 <- snp_asGeneticPos(CHR, POS, dir = "/filepath/genetic_map_hg38_withX.txt.gz").

This step only works with internet access and repeatedly tries to access the OMNI files from the hg19 build (chromosome separated). Is there a work around for this step?

I will be using an LD reference based on the local dataset so I do not need the hapMap data but will need to calculate the correlation values from the matrix in this step.

privefl commented 5 months ago

You need to ask for those files to be put on your server. This has been discussed in other issues here. BTW, dir is for the directory, not the full path.

Otherwise, you can use something like 3MB window, identify nearly-independent LD blocks from that, and then re-compute all the values within the LD blocks (by using something like POS2 <- block_id, and size = 1e-4); this will probably give you the best LD matrix.

uki-uiu commented 4 months ago

Hello! Thank you for getting back to me!

I decided to limit my analysis to only the HapMap SNPs as described in your tutorial and utilize the genetic distances available from https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_from_hapmap to avoid the use of snp_asGeneticPos function.

I input the genetic distances directly (DIST= genetic positions derived from the link above and df_beta<-info.pos)). And used this within the loop

pos2_table<-df_beta[df_beta$chr==chr]
POS2<-sort(pos2_table$DIST)

and

corr0<-snp_cor(G,ind.col=ind.ch2,size=3/1000, infos.pos=POS2,ncores=NCORES)

Generates a 15826x15826 correlation matrix without any errors and warnings.

The rest of the script runs smoothly except at the end when I perform the scoring, all the participants end up with "NA" scores (pred_auto results in all NA values and therefore I cannot create a model at the end). I have checked other LDPred issues on Github cant find one which is a similar situation as mine.

Do you see any reason I am getting this error?

Please advise.

privefl commented 4 months ago

There are several issues like this here. Basically, if you have NAs in the polygenic scores, it means you either have NAs in the effects you get from LDpred2-auto, or you have NAs in the genotype matrix that you use to compute the polygenic scores.

uki-uiu commented 4 months ago

Thank you Dr.Privé, I will look into it!

privefl commented 2 months ago

Any update on this?

uki-uiu commented 2 months ago

I managed to solve the issue after going through the the other issues/solutions posted. Thank you

privefl commented 2 months ago

Could you quickly summarize your solution for others?

And then close the issue, if there is nothing else on this.

uki-uiu commented 1 month ago

I ended up using the Pred_grid option and imputed any missing genotypes G2<-snp_fastImputeSimple(G,method="mean2")

And to avoid the issues I was facing using the script for scoring the test-set, I extracted the SNPs from the PGS (with the best parameters after tuning with the validation cohort) created in the earlier steps `` I then used another tool to score the participants in the test set.