privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Error in rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est)) : 'x' must be an array of at least two dimensions #416

Closed xuxwen closed 1 year ago

xuxwen commented 1 year ago

HeHi Florian!

I am attempting to computer some PRS weights using LDpred2-auto with the snp_ldpred2_auto() function. I often get this error when I execute the following code: multi_auto <- snp_ldpred2_auto( corr, df_beta, h2_init = h2_est, vec_p_init = seq_log(1e-4, 0.2, length.out = 30), ncores = NCORES, burn_in = 500, num_iter = 500, report_step = 20,

use_MLE = FALSE, # uncomment if you have convergence issues (need v1.11.9)

allow_jump_sign = FALSE, shrink_corr = 0.95)

range <- sapply(multi_auto, function(auto) diff(range(auto$corr_est))) keep <- which(range > (0.95 * quantile(range, 0.95, na.rm = TRUE)) & range < 3) beta_auto <- rowMeans(sapply(multi_auto[keep], function(auto) auto$best_est))

image If you check my range, you will see that the values are all NA! What could be the problem?

privefl commented 1 year ago

There are some divergence issues. Did you perform some proper QC on the sumstats?

If yes, then try use_MLE = FALSE, and decreasing shrink_corr (e.g. 0.5-0.8).

xuxwen commented 1 year ago

Yes, the eQTL data I used myself had MAF > 1% for all SNP information

xuxwen commented 1 year ago

There are some divergence issues. Did you perform some proper QC on the sumstats?

If yes, then try use_MLE = FALSE, and decreasing shrink_corr (e.g. 0.5-0.8).

But I would still like to know which specific steps of QC you are referring to on your side?

privefl commented 1 year ago

https://privefl.github.io/bigsnpr/articles/LDpred2.html#quality-control-of-gwas-summary-statistics

xuxwen commented 1 year ago

Thanks!Florian! Yes, our sumstats results have all been QC. But now there is still a new problem, we are using OncoArray GWAS data 41633 individuals to predict the outcome of individuals. But pred_autored_auto_red_auto_sparse results found that h many individuals were not predicted to be 0 (they should be if the individual results are meaningless) and many null values were found. The following images are pred_auto_sparse results: image

privefl commented 1 year ago

0s or NAs? Do you have missing values in your data (genotypes) that appear when you calculate the polygenic scores (by the product of genotypes x effect sizes)?

xuxwen commented 1 year ago

0s or NAs? Do you have missing values in your data (genotypes) that appear when you calculate the polygenic scores (by the product of genotypes x effect sizes)?

Our oncoarray data is already imputed in by TOPMED, so I'll have to take a closer look to see if there are any missing values, and I'll get back to you later! Please forgive me!

xuxwen commented 1 year ago

0s or NAs? Do you have missing values in your data (genotypes) that appear when you calculate the polygenic scores (by the product of genotypes x effect sizes)?

Oncoarray data for genotypes with no missing values! I would like to ask if multiple isogenic loci will affect the j results? image

privefl commented 1 year ago

Could you check in R (before writing to the file)?

xuxwen commented 1 year ago

Could you check in R (before writing to the file)?

There should be no problem, my fellow researcher's methylation site construction PRS uses the same code, uses the same auto model but the pred_auto as well as pred_auto_sparse in the output just shows the predicted values for each individual, I just can't understand it! image

privefl commented 1 year ago

Without more information, I can't help you more.

xuxwen commented 1 year ago

Without more information, I can't help you more.

What more information is needed? I'll provide as much as I can and am very keen to address this!

privefl commented 1 year ago

I suspect these blank lines in the file are missing values, not 0s. Can you check in R in the vector that you write to the file?

Then, if these are missing values, you need to check whether these come from the PGS effect sizes (not likely), or the genotype matrix (most likely).

xuxwen commented 1 year ago

I suspect these blank lines in the file are missing values, not 0s. Can you check in R in the vector that you write to the file?

Then, if these are missing values, you need to check whether these come from the PGS effect sizes (not likely), or the genotype matrix (most likely).

I have checked the read file in R and confirmed that the blank lines are missing values (NA) and not 0. And I have also checked the genotype matrix and was able to confirm that there are no missing values!

privefl commented 1 year ago

If you have missing values when you compute the PGS, it must be that you have missing values in the genotype matrix. How did you check there was no missing value? You can e.g. use big_counts(). No missing values in the effect sizes as well?

xuxwen commented 1 year ago

If you have missing values when you compute the PGS, it must be that you have missing values in the genotype matrix. How did you check there was no missing value? You can e.g. use big_counts(). No missing values in the effect sizes as well?

Thanks again! I followed your advice and used sum(big_counts(G)[4, ]) and the final return value is 514272579, what does this step mean? What does the return value represent?

privefl commented 1 year ago

big_counts(G)[4, ] should give you a vector with the number of missing values for each variant.

xuxwen commented 1 year ago

big_counts(G)[4, ] should give you a vector with the number of missing values for each variant.

Thank you very much! Again, we looked carefully to see that there were indeed missing values, Oncoarray, a total of 18,610,000 variants geno0.05 about 30,000 variants missing! This is most likely because we were too conservative when we first filled the TOPMED considering that onco was a merger of two samples! Thanks again for your careful answer! I'd like to make a final enquiry here about how long it would take me to fill in 18.6 million variants d using G2 <- snp_fastImputeSimple(G,method = c("mean2"))! Thanks again! I wish you all the best!

privefl commented 1 year ago

Depends on the number of samples I guess, but it might take a while. Count maybe 1 hour / 100GB of the data.

Note that you only need the variants that were used in LDpred2, so maybe you could have read only those to the bigSNP format.

Make sure these variants do not have too many missing values (< 10%) before using snp_fastImputeSimple().

privefl commented 1 year ago

Any update on this?