privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

snp_fastImpute xgboost error #386

Closed sounkou-bioinfo closed 1 year ago

sounkou-bioinfo commented 1 year ago

Hello Dr Prive,

The subject of this issue is a question relating to an error I got when trying to test the snp_fastImpute method from the bigsnpr package. To explain the context, I try to impute the missing genotypes for 115 genotyped individuals with the GSA v3 chip. But I get the following error code after ~1 hour. It should be noted that the methods works when i subset (using snp_subset) the dataset to 10 000 SNPs from chromosome 1.

Error code :

 Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
  [18:37:26] amalgamation/../src/objective/./regression_loss.h:94: Check failed: base_score > 0.0f && base_score < 1.0f: base_score must be in (0,1) for logistic loss, got: nan
Stack trace:
  [bt] (0) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(+0x645b6) [0x7ff6513885b6]
  [bt] (1) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(+0x17b801) [0x7ff65149f801]
  [bt] (2) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(+0x1de172) [0x7ff651502172]
  [bt] (3) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(+0x1dd483) [0x7ff651501483]
  [bt] (4) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(XGBoosterUpdateOneIter+0x64) [0x7ff6513af504]
  [bt] (5) /home/cbenhamda/R/x86_64-pc-linux-gnu-library/4.0/xgboost/libs/xgboost.so(XGBoosterUpdateOneIter_R+0x3d) [0x7ff65138231d]
  [bt] (6) /usr/lib/R/lib/libR.so(+0xfb8c0) [0x7ff6565058c0]
  [bt] (7) /usr/lib/R/lib/
Calls: snp_fastImpute -> imputeChr -> <Anonymous> -> xgb.iter.update
In addition: Warning message:
NA or NaN values in the resulting correlation matrix.
Execution halted

The code is the following:

### bed obtained using plink --chr 1-22 --merge-list of individual sample beds
### individual beds obtained from using plink --vcf  of sample specific genotype calls 

bedfile <- "/home/cbenhamda/GWASDB/TestList.autosome.bed" 
tmpfile <- tempfile(pattern=basename(sub_bed(bedfile)), 
                tmpdir=dirname(sub_bed(bedfile)),
                fileext="")
### load big snp
snp_readBed(bedfile, backingfile = tmpfile)
obj.bigSNP <- snp_attach(paste0(tmpfile, ".rds"))
str(obj.bigSNP, max.level = 2, strict.width = "cut")

# Get aliases for useful slots
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
big_counts(G, ind.col = 1:10)
# Half of the cores you have on your computer
NCORES <- nb_cores()
#ImputedG <- snp_fastImputeSimple(G)#CHR,ncores=NCORES,seed=0)
print("==== imputation2")
Imputed_xgboost <- snp_fastImpute(G,CHR,ncores=NCORES,seed=0)

Session info :

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_1.6.0.1  bigsnpr_1.11.4   bigstatsr_1.5.12

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3       bigsparser_0.6.1   pillar_1.7.0       compiler_4.0.4    
 [5] bigparallelr_0.3.2 rngtools_1.5.2     iterators_1.0.14   digest_0.6.29     
 [9] jsonlite_1.8.0     lifecycle_1.0.1    tibble_3.1.6       gtable_0.3.0      
[13] lattice_0.20-41    doRNG_1.8.2        pkgconfig_2.0.3    rlang_1.0.2       
[17] Matrix_1.3-2       foreach_1.5.2      DBI_1.1.2          cli_3.2.0         
[21] flock_0.7          parallel_4.0.4     bigassertr_0.1.5   dplyr_1.0.8       
[25] generics_0.1.2     vctrs_0.3.8        grid_4.0.4         tidyselect_1.1.2  
[29] cowplot_1.1.1      data.table_1.14.4  glue_1.6.2         R6_2.5.1          
[33] fansi_1.0.2        ggplot2_3.3.5      purrr_0.3.4        magrittr_2.0.2    
[37] scales_1.1.1       codetools_0.2-18   ellipsis_0.3.2     assertthat_0.2.1  
[41] colorspace_2.0-3   utf8_1.2.2         munsell_0.5.0      doParallel_1.0.17 
[45] crayon_1.5.0   
privefl commented 1 year ago

You probably have some variants with almost only missing values. You need to perform some QC beforehand to remove all variants with e.g. more than 10% or 20% of missing values.

sounkou-bioinfo commented 1 year ago

Hi florian It was indeed a missing variant issue. Thank you. I have another related question. Would you recommend using the method for imputing say hap map 3 (or the SNP list in the package) with 1000 G or other external panels or using standard imputation workflows is more efficient for that task ?

privefl commented 1 year ago

It has not been tested with external ref panel to impute completely missing variants, and the code would have to be modified, so I would just recommend standard imputation methods with large ref panels such as HRC.