How to deal "missing value error" on imputed genotypes data

privefl / bigsnpr

R package for the analysis of massive SNP arrays.

https://privefl.github.io/bigsnpr/

190 stars 44 forks source link

How to deal "missing value error" on imputed genotypes data #511

Open panqinglzmc opened 3 months ago

panqinglzmc commented 3 months ago

I am using the demonstration data "penncath" from the tutorial of the R package [bigsnpr]. I have verified that there are no missing values in the imputed obj$geno_imputed data, both by SNP and samples. However, when I use the big_SVD function for principal component analysis, I still encounter an error: "You can't have missing values in 'X'", as shown below. I would be very grateful if someone could help me identify where I might have made a mistake.

privefl commented 3 months ago

using the demonstration data "penncath" from the tutorial of the R package [bigsnpr]

Not sure which data you're talking about??
Could you show snp_stats[, 1:5] please? (including the rownames)
Not related to this issue, but I would recommend using snp_autoSVD() instead of big_SVD() for genotype data.

panqinglzmc commented 3 months ago

Thanks Florian for the quick turnaround! And I must apologize for not explaining clearly. I am using the imputed genotype data from the GWAS tutorial: Imputation available at this link. The data was generated using the following code:

The _snpstats[, 1:5] output is as follows, identical to what is shown in the tutorial.

However, when passing this _obj$genoimputed data to the _bigSVD function in the subsequent code in the GWAS tutorial: Population structure (available here), it results in the error mentioned above.

### But I am very happy to say that following your suggestion to use _snp_autoSVD() instead of bigSVD() for genotype data, the problem has been solved, and the code now runs smoothly, as shown below.

Thank you so much, Florian. Your suggestion has been incredibly helpful.

privefl commented 3 months ago

This is good that you found some workaround. But the initial issue is not really fixed. I have some idea what's going on. Could you please confirm your packageVersion("bigstatsr")?

panqinglzmc commented 3 months ago

My ‘bigstatsr’ version is 1.5.12.

Ps. My 'bigsnpr' version is 1.12.2. I loaded the bigsnpr package, and the bigstatsr package was automatically loaded along with it. Subsequently, I used the following two functions: _snpfastImputeSimple {bigsnpr} and _snpautoSVD {bigsnpr}.

privefl commented 3 months ago

What do you get if you run this reproducible code?

zip <- runonce::download_file(
  "https://d1ypx1ckp5bo16.cloudfront.net/penncath/penncath.zip",
  dir = "tmp-data")
unzip(zip, exdir = "tmp-data", overwrite = FALSE)

library(bigsnpr)
snp_readBed("tmp-data/data/penncath.bed")
penncath <- snp_attach("tmp-data/data/penncath.rds")
penncath$geno_imputed <- snp_fastImputeSimple(Gna = penncath$genotypes,
                                              method = "mode",
                                              ncores = nb_cores())

big_SVD(penncath$geno_imputed, big_scale(), k = 10)

For me, it runs forever because there are some variables with no variation that prevent convergence (which now errors with v1.5.14). But I don't get the error about missing values (with both v1.5.12 and v1.5.15).

panqinglzmc commented 3 months ago

I ran the code, and come out the same error.

privefl commented 3 months ago

I cannot reproduce the issue, and I have no idea what's going on :/ Is this the only function where you have this issue? (e.g. if you also try running big_univLinReg(penncath$geno_imputed, rnorm(1401), ind.col = 1:100))

PS: You should try not to change the working directory; use RStudio projects and stick with the working directory of the project.