privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Error in step #3 of projecting PC scores on new data #496

Open ozandikilitas opened 2 months ago

ozandikilitas commented 2 months ago

Hello, I receive the following error upon trying to project a new dataset onto 1KG reference panel.

> obj.bed.new
A 'bed' object with 12096 samples and 128122270 variants.
> test_pca <- bed_projectPCA(
+   obj.bed,
+   obj.bed.new,
+   k = 20,
+   ind.row.new = rows_along(obj.bed.new),
+   ind.row.ref = ind.row,
+   ind.col.ref = ind.col,
+   strand_flip = TRUE,
+   join_by_pos = TRUE,
+   match.min.prop = 0.5,
+   verbose = TRUE,
+   ncores = nb_cores()
+ ) 

[Step 1/3] Matching variants of reference with target data.. 1,233,111 variants to be matched. 6 ambiguous SNPs have been removed. 1,228,268 variants have been matched; 0 were flipped and 317,288 were reversed.

[Step 2/3] Computing (auto) SVD of reference..

Phase of clumping (on MAC) at r^2 > 0.2.. keep 302930 variants. Discarding 0 variant with MAC < 10.

Iteration 1: Computing SVD.. 0 outlier variant detected..

Converged!

[Step 3/3] Projecting PC scores on new data.. Error in { : task 1 failed - "'...' used in an incorrect context"

Have you encountered this issue before (or the specific error I am receiving at the end)? Thank you for your help!

ozandikilitas commented 2 months ago

Here's the line by line debugging, problem seems to stem from big_parallelize(). What would you like me to provide next?

 aric_test_pca <- bed_projectPCA(
+   obj.bed,
+   obj.bed.aric,
+   k = 20,
+   ind.row.new = rows_along(obj.bed.aric),
+ #  ind.row.ref = ind.row,
+ #  ind.col.ref = ind.col,
+   strand_flip = TRUE,
+   join_by_pos = TRUE,
+ #  match.min.prop = 0.5,
+   verbose = TRUE,
+   ncores = nb_cores()
+ ) 
debugging in: bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
    strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
debug: {
    printf2 <- function(...) if (verbose) 
        printf(...)
    printf2("\n[Step 1/3] Matching variants of reference with target data..\n")
    map.ref <- setNames(obj.bed.ref$map[-3], c("chr", "rsid", 
        "pos", "a1", "a0"))
    map.new <- setNames(obj.bed.new$map[-3], c("chr", "rsid", 
        "pos", "a1", "a0"))
    if (join_by_pos && (build.new != build.ref)) {
        if (is.null(liftOver)) {
            stop2("Please provide path to liftOver executable.")
        }
        else if (substr(liftOver, 1, 1) != ".") {
            liftOver <- file.path(".", liftOver)
        }
        map.new <- snp_modifyBuild(map.new, liftOver = liftOver, 
            from = build.new, to = build.ref)
    }
    info_snp <- snp_match(cbind(map.ref, beta = 1), map.new, 
        strand_flip = strand_flip, join_by_pos = join_by_pos, 
        match.min.prop = match.min.prop)
    printf2("\n[Step 2/3] Computing (auto) SVD of reference..\n")
    obj.svd <- bed_autoSVD(obj.bed.ref, ind.row = ind.row.ref, 
        ind.col = intersect(ind.col.ref, info_snp$`_NUM_ID_.ss`), 
        k = k, ncores = ncores, verbose = verbose, ...)
    printf2("\n[Step 3/3] Projecting PC scores on new data..\n")
    keep <- match(attr(obj.svd, "subset"), info_snp$`_NUM_ID_.ss`)
    X_norm <- FBM(length(ind.row.new), 1, init = 0)
    XV <- FBM(length(ind.row.new), k, init = 0)
    big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
        ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
        center = (obj.svd$center - 1) * info_snp$beta[keep] + 
            1, scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
        XV = XV, X_norm = X_norm)
    list(obj.svd.ref = obj.svd, simple_proj = XV[, , drop = FALSE], 
        OADP_proj = OADP_proj(XV, X_norm, obj.svd$d, ncores = ncores))
}
Browse[2]> n
debug: printf2 <- function(...) if (verbose) printf(...)
Browse[2]> n
debug: printf2("\n[Step 1/3] Matching variants of reference with target data..\n")
Browse[2]> n

[Step 1/3] Matching variants of reference with target data..
debug: map.ref <- setNames(obj.bed.ref$map[-3], c("chr", "rsid", "pos", 
    "a1", "a0"))
Browse[2]> n
debug: map.new <- setNames(obj.bed.new$map[-3], c("chr", "rsid", "pos", 
    "a1", "a0"))
Browse[2]> n
debug: if (join_by_pos && (build.new != build.ref)) {
    if (is.null(liftOver)) {
        stop2("Please provide path to liftOver executable.")
    }
    else if (substr(liftOver, 1, 1) != ".") {
        liftOver <- file.path(".", liftOver)
    }
    map.new <- snp_modifyBuild(map.new, liftOver = liftOver, 
        from = build.new, to = build.ref)
}
Browse[2]> n
debug: info_snp <- snp_match(cbind(map.ref, beta = 1), map.new, strand_flip = strand_flip, 
    join_by_pos = join_by_pos, match.min.prop = match.min.prop)
Browse[2]> n
1,233,111 variants to be matched.
6 ambiguous SNPs have been removed.
1,228,268 variants have been matched; 0 were flipped and 317,288 were reversed.
debug: printf2("\n[Step 2/3] Computing (auto) SVD of reference..\n")
Browse[2]> n

[Step 2/3] Computing (auto) SVD of reference..
debug: obj.svd <- bed_autoSVD(obj.bed.ref, ind.row = ind.row.ref, ind.col = intersect(ind.col.ref, 
    info_snp$`_NUM_ID_.ss`), k = k, ncores = ncores, verbose = verbose, 
    ...)
Browse[2]> n

Phase of clumping (on MAC) at r^2 > 0.2.. keep 304498 variants.
Discarding 0 variant with MAC < 10.

Iteration 1:
Computing SVD..
364 outlier variants detected..
5 long-range LD regions detected..

Iteration 2:
Computing SVD..
0 outlier variant detected..

Converged!
debug: printf2("\n[Step 3/3] Projecting PC scores on new data..\n")
Browse[2]> n

[Step 3/3] Projecting PC scores on new data..
debug: keep <- match(attr(obj.svd, "subset"), info_snp$`_NUM_ID_.ss`)
Browse[2]> n
debug: X_norm <- FBM(length(ind.row.new), 1, init = 0)
Browse[2]> n
debug: XV <- FBM(length(ind.row.new), k, init = 0)
Browse[2]> n
debug: big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
    ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
    center = (obj.svd$center - 1) * info_snp$beta[keep] + 1, 
    scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
    XV = XV, X_norm = X_norm)
Browse[2]> n
Error in { : task 1 failed - "'...' used in an incorrect context"
> traceback()
6: stop(simpleError(msg, call = expr))
5: e$fun(obj, substitute(ex), parent.frame(), e$data)
4: foreach(ic = rows_along(intervals)) %dopar% {
       ind.part <- ind[seq(intervals[ic, "lower"], intervals[ic, 
           "upper"])]
       FUN(ind = ind.part, ...)
   }
3: bigparallelr::split_parapply(p.FUN, ind, X, ..., .combine = p.combine, 
       ncores = ncores, opts_cluster = list(type = getOption("bigstatsr.cluster.type")))
2: big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
       ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
       center = (obj.svd$center - 1) * info_snp$beta[keep] + 1, 
       scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
       XV = XV, X_norm = X_norm)
1: bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
       strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.8.0
LAPACK: /usr/lib64/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bigsnpr_1.12.2    bigstatsr_1.5.12  patchwork_1.2.0   ggplot2_3.5.0    
 [5] flashpcaR_2.1     Hmisc_5.1-2       tidyr_1.3.1       tibble_3.2.1     
 [9] data.table_1.15.4 dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] runonce_0.2.3      foreach_1.5.2      carData_3.0-5      Formula_1.2-5     
 [5] doRNG_1.8.6        robustbase_0.99-2  pillar_1.9.0       backports_1.4.1   
 [9] lattice_0.20-45    glue_1.7.0         digest_0.6.35      ggsignif_0.6.4    
[13] checkmate_2.3.1    colorspace_2.1-1   cowplot_1.1.3      htmltools_0.5.8.1 
[17] Matrix_1.6-5       pkgconfig_2.0.3    broom_1.0.5        bigparallelr_0.3.2
[21] purrr_1.0.2        rmio_0.4.0         scales_1.3.0       RSpectra_0.16-1   
[25] ff_4.0.12          htmlTable_2.4.2    generics_0.1.3     car_3.1-2         
[29] ggpubr_0.6.0       withr_3.0.0        nnet_7.3-18        cli_3.6.2         
[33] magrittr_2.0.3     evaluate_0.23      ps_1.7.6           bigassertr_0.1.6  
[37] fansi_1.0.6        parallelly_1.37.1  doParallel_1.0.17  rstatix_0.7.2     
[41] foreign_0.8-86     tools_4.2.2        bigutilsr_0.3.4    lifecycle_1.0.4   
[45] stringr_1.5.1      munsell_0.5.1      bigsparser_0.6.1   cluster_2.1.4     
[49] rngtools_1.5.2     compiler_4.2.2     rlang_1.1.3        grid_4.2.2        
[53] iterators_1.0.14   rstudioapi_0.16.0  htmlwidgets_1.6.4  base64enc_0.1-3   
[57] rmarkdown_2.26     gtable_0.3.4       codetools_0.2-18   flock_0.7         
[61] abind_1.4-5        R6_2.5.1           gridExtra_2.3      knitr_1.46        
[65] bit_4.0.5          fastmap_1.1.1      utf8_1.2.4         bigreadr_0.2.5    
[69] stringi_1.8.3      parallel_4.2.2     Rcpp_1.0.12        vctrs_0.6.5       
[73] rpart_4.1.19       DEoptimR_1.1-3     tidyselect_1.2.1   xfun_0.43 
privefl commented 2 months ago

Thanks for reporting.

Does this work for you?

library(bigsnpr)

bedfile <- download_1000G("tmp-data")
obj.bed.aric <- obj.bed <- bed(bedfile)

bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
               strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
ozandikilitas commented 2 months ago

It does! With the following object as a result:

str(test)

List of 3
 $ obj.svd.ref:List of 7
  ..$ d     : num [1:20] 8238 5427 2966 2589 1288 ...
  ..$ u     : num [1:2490, 1:20] -0.013 -0.0129 -0.013 -0.0128 -0.0129 ...
  ..$ v     : num [1:378519, 1:20] 2.57e-03 -2.84e-05 2.64e-03 -1.46e-03 -1.19e-03 ...
  ..$ niter : num 7
  ..$ nops  : num 234
  ..$ center: num [1:378519] 0.0691 0.0538 0.0703 0.3827 0.1522 ...
  ..$ scale : num [1:378519] 0.258 0.229 0.26 0.556 0.375 ...
  ..- attr(*, "class")= chr "big_SVD"
  ..- attr(*, "subset")= int [1:378519] 1 3 8 9 10 12 14 16 18 19 ...
  ..- attr(*, "lrldr")='data.frame':    5 obs. of  3 variables:
  .. ..$ Chr  : num [1:5] 6 6 6 15 17
  .. ..$ Start: num [1:5] 30468791 33260215 33684313 25677831 1863804
  .. ..$ Stop : num [1:5] 32800427 33512731 36288421 25823561 2359952
 $ simple_proj: num [1:2490, 1:20] -107 -106 -107 -105 -106 ...
 $ OADP_proj  : num [1:2490, 1:20] -107 -107 -108 -106 -107 ...

Any suggestions on next steps?

privefl commented 2 months ago

What do you get for class(obj.bed) and class(obj.bed.aric) with your data?

ozandikilitas commented 2 months ago

Here it is > class(obj.bed) [1] "bed" attr(,"package") [1] "bigsnpr" > class(obj.bed.aric) [1] "bed" attr(,"package") [1] "bigsnpr"

ozandikilitas commented 2 months ago

FYI, when I also do the following with my obj.bed.aric (after following QC, removing related, restrict to HapMap3 etc.), it works with no issues. Is the issue something to do with subsetting datasets to the common SNPs between the reference and target?

bed_projectPCA(obj.bed.aric, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())

privefl commented 2 months ago

You're using obj.bed.aric twice in the latest code you presented; is a test or just a typo? Is it working when you do so?

I'm not sure I understood everything you tried to do (QC, etc). Could you try only one of those at once to see which one solves the problem? We could then have a better guess on finding what's going on (I currently have no idea).

ozandikilitas commented 2 months ago

No, it was just a test. I still receive the original error if I use 1000G as the reference and ARIC as the target dataset. QC stuff I mentioned was just going through what you have done on 1000G on your PCA vignette, nothing more than that. In any case, I still receive the error when the reference and target datasets are different.

privefl commented 1 month ago

But you don't when using only ARIC? And you don't when using some QC?

ozandikilitas commented 1 month ago

Yes, if I use either 1000G OR ARIC for both reference and target dataset within bed_projectPCA() (i.e., bed_projectPCA(obj.bed.aric, obj.bed.aric, ...) OR bed_projectPCA(obj.bed.1KG, obj.bed.1KG, ...), the function executes without any errors. However, if I put 1000G as the reference and ARIC as the target dataset (i.e., bed_projectPCA(obj.bed.1KG, obj.bed.ARIC, ...)), I get the error I posted in my first message. Forget about the QC stuff I mentioned, that has not impact on the error message whatsoever, I just unnecessarily confused you with that.

privefl commented 1 month ago

Would you be able to share the bim file (infos on the variants) from the ARIC dataset with me?

ozandikilitas commented 1 month ago

After you requested the bim file, I figured out the problem. Reference dataset bim file had variant names in the format chr.pos whereas target dataset (i.e., ARIC) had variant names in the format chr:pos:ref:alt. Even though I had join_by_pos=TRUE in bed_projectPCA, I guess having different variant names created an issue. Once I made the variant names in the exact same format across both datasets, bed_projectPCA runs without any errors. Thank you for all your help! Happy to provide more information if you need it!

privefl commented 1 month ago

Variant names should not matter. For example, this works:

library(bigsnpr)

bedfile <- download_1000G("tmp-data")
obj.bed <- bed(bedfile)

bigsnp <- snp_attach(snp_readBed2(bedfile, tempfile(), ind.col = 1:1000))
bigsnp$map$marker.ID <- 
  with(bigsnp$map, paste(chromosome, physical.pos, allele1, allele2, sep = ":"))
obj.bed.aric <- bed(snp_writeBed(bigsnp, tempfile(fileext = ".bed")))

bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
               strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
ozandikilitas commented 1 month ago

Interesting, that is the only thing I changed. I will look into it a bit more to see whether I am missing something.

privefl commented 1 month ago

If we can understand what was going on, that would be great.