privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Extract betas from lassosum2 #320

Closed Fnyasimi closed 2 years ago

Fnyasimi commented 2 years ago

Hi @privefl thank you for providing lassosum2 reimplementation. I have been using lassosum to run my analyses and would like to try the lassosum2 to check for improved performance.

I have looked at this tutorial and I would like to confirm;

  1. Does the best_grid_lassosum2 contains the betas for all the SNPs in df_beta and in the same order? I would love to extract all the non-zero betas and their rsid or chr::pos and allele info for further use downstream.

  2. Is there an equvalent of pseudovalidate method re-implemented in lassosum2 in scenarios where by we don't have the observed phenotype?

privefl commented 2 years ago
  1. Yes, same order as in df_beta.

  2. No, pseudovalidation has not been reimplemented since I have not found it to be robust enough (cf. preprint).

Fnyasimi commented 2 years ago

Thank you for the response.

I would like to use the precomputed LD reference and am not sure if I should match the summary stats to the ld reference or the validation set or both. Do I use the approach described here when using the precomputed ld reference?

privefl commented 2 years ago

Yes, you should match both to the LD reference. You can do a quick filter to keep only the variants that you also have in the validation/test set (in_test) for deriving the PGS, and then do the second snp_match() later.

Fnyasimi commented 2 years ago

My matching has been an issue I am not sure what am doing wrong but my GWAS ss, LD ref and test set contain a overlapping set of SNP but when I try to get the pred_grid I end up with NAs. I am not sure if this is a result of mismatch in the _NUM_ID_ between the refLD and the test set on a bug. I have also tried the approach explained in this #318 but still getting NAs. What could be the issue?

privefl commented 2 years ago

NAs in the effects sizes produced by lassosum2 and LDpred2-grid corresponds to models that completely diverged.

But if you have NAs in the predictions you get with LDpred2-auto, it must be that you have some NAs in the genotype matrix of your test set.

Fnyasimi commented 2 years ago

Thanks for the feedback I have done a few QC steps.

I imputed the Genotype using this function G2 <- snp_fastImputeSimple(G, method = "mean2", ncores = nb_cores())

I checked the betas for lassosum2 and LDpred2-grid they don't contain NAs though some columns have only 0s.

When I run prediction using the big_prodMat function I end up with NAs in my prediction. But when I run the same analysis limiting the input data to chromosome 1 and 2 I get the results, am not sure what happens when I scale it up to all chromosomes. Any ideas on things I could look out for?

Also a quick question what mappings do you to convert the genomic coordianates from pos to cM?

privefl commented 2 years ago
Fnyasimi commented 2 years ago

I am using bigsnpr v1.9.11

Yes I am using the G2 in big_prodMat()

privefl commented 2 years ago

If you have NAs out, you must have NAs in.. I don't see any other explanation. You can check anyNA(beta) and counts <- big_counts(G2); sum(counts[4, ]).

privefl commented 2 years ago

Any update on this?

Fnyasimi commented 2 years ago

No updates so far. I will get back at it later on and try to find the problem. Feel free to close the issue and if I get an update I can comment on it. Thanks!