privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

Integer overflow warning in snp_cor() #434

Closed fmorgante closed 1 year ago

fmorgante commented 1 year ago

Hi Florian, I have gotten the following warning when running snp_cor to compute the per-chromosome sparse LD matrix for chromosomes 1-3,5,6 from ~155K UKB individuals from imputed data.

Warning message:
In (function (Gna, ind.row = rows_along(Gna), ind.col = cols_along(Gna),  :
  integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

Is this something you have encountered?

privefl commented 1 year ago

How many variants on e.g. chr1?

privefl commented 1 year ago

Have you looked at https://github.com/privefl/bigsnpr/issues?q=is%3Aissue+integer+overflow+in+%27cumsum%27?

fmorgante commented 1 year ago
> library(bigsnpr)
Loading required package: bigstatsr
> geno <- snp_attach("/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.rds")
> str(geno)
List of 2
 $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
  ..$ extptr      :<externalptr> 
  ..$ extptr_rw   :<externalptr> 
  ..$ nrow        : int 152434
  ..$ ncol        : int 1054330
  ..$ type        : Named int 1
  .. ..- attr(*, "names")= chr "unsigned char"
  ..$ backingfile : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.bk"
  ..$ is_read_only: logi FALSE
  ..$ address     :<externalptr> 
  ..$ address_rw  :<externalptr> 
  ..$ bk          : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.bk"
  ..$ rds         : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.rds"
  ..$ is_saved    : logi TRUE
  ..$ type_chr    : chr "unsigned char"
  ..$ type_size   : int 1
  ..$ file_size   : num 1.61e+11
  ..$ code256     : num [1:256] 0 1 2 NA 0 1 2 0 0.01 0.02 ...
  ..and 26 methods, of which 12 are  possibly relevant:
  ..  add_columns, as.FBM, bm, bm.desc, check_dimensions, check_write_permissions, copy#envRefClass, initialize,
  ..  initialize#FBM, save, show#envRefClass, show#FBM
 $ map      : tibble [1,054,330 × 8] (S3: tbl_df/tbl/data.frame)
  ..$ chromosome  : chr [1:1054330] "01" "01" "01" "01" ...
  ..$ marker.ID   : chr [1:1054330] "rs3131972" "1:754182_A_G" "1:760912_C_T" "rs12562034" ...
  ..$ rsid        : chr [1:1054330] "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
  ..$ physical.pos: int [1:1054330] 752721 754182 760912 768448 779322 838555 846808 853954 854250 864938 ...
  ..$ allele1     : chr [1:1054330] "A" "A" "C" "G" ...
  ..$ allele2     : chr [1:1054330] "G" "G" "T" "A" ...
  ..$ freq        : num [1:1054330] 0.841 0.87 0.84 0.105 0.128 ...
  ..$ info        : num [1:1054330] 1 0.993 0.991 1 1 ...
 - attr(*, "class")= chr "bigSNP"
> table(geno$map$chromosome)

   01    02    03    04    05    06    07    08    09    10    11    12    13    14    15    16    17    18    19 
87145 87521 73414 65490 66330 71046 58036 57084 48475 56549 53965 51492 40130 35093 31768 32431 28771 31497 19803 
   20    21    22 
27728 15113 15449 

That's the genotype data I am using as input to snp_cor.

inds <- which(geno$map$chromosome==chr)

POS <- snp_asGeneticPos(as.integer(geno$map$chromosome), geno$map$physical.pos, 
                            dir=genetic_map, 
                            ncores=ncores)

LD <- snp_cor(Gna=geno$genotypes, ind.col=inds, size=0.003, infos.pos=POS[inds], thr_r2=0, ncores=ncores)

where genetic_map points to 1000-genomes-genetic-maps/interpolated_OMNI.

privefl commented 1 year ago

In inds, do you have all indices, or do you do it per chromosome?

fmorgante commented 1 year ago

I do it per-chromosome, so inds includes the indices of the chosen chromosome -- see second block of code.

privefl commented 1 year ago

Can you share POS[inds], for e.g. chr 1?

fmorgante commented 1 year ago

Here it is.

POS_chr1.rds.zip

privefl commented 1 year ago

That's the same of what I have, so I don't understand what's the problem.

What do you have for packageVersion("bigsnpr"), packageVersion("Matrix"), and R.version.string?

fmorgante commented 1 year ago
> packageVersion("bigsnpr")
[1] ‘1.12.2’
> packageVersion("Matrix")
[1] ‘1.5.3’
> R.version.string
[1] "R version 4.2.3 (2023-03-15)"

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.5 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /opt/intel/oneapi/mkl/2023.0.0/lib/intel64/libmkl_gf_lp64.so.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bigsnpr_1.12.2   bigstatsr_1.5.12

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10        magrittr_2.0.3     cowplot_1.1.1      tidyselect_1.2.0
 [5] munsell_0.5.0      doParallel_1.0.17  lattice_0.20-45    colorspace_2.1-0
 [9] R6_2.5.1           doRNG_1.8.6        rlang_1.1.0        foreach_1.5.2
[13] fansi_1.0.4        dplyr_1.1.1        rngtools_1.5.2     parallel_4.2.3
[17] grid_4.2.3         data.table_1.14.8  gtable_0.3.3       utf8_1.2.3
[21] cli_3.6.1          bigparallelr_0.3.2 iterators_1.0.14   digest_0.6.31
[25] tibble_3.2.1       lifecycle_1.0.3    Matrix_1.5-3       bigsparser_0.6.1
[29] bigassertr_0.1.6   ggplot2_3.4.2      flock_0.7          vctrs_0.6.1
[33] codetools_0.2-19   glue_1.6.2         compiler_4.2.3     pillar_1.9.0
[37] generics_0.1.3     scales_1.2.1       pkgconfig_2.0.3
privefl commented 1 year ago

Do you know how to use debugonce("snp_cor") to run some code inside a function line by line? The issue probably occurs here: https://github.com/privefl/bigsnpr/blob/master/R/corr.R#L46.

If you could try this quickly, you can subset to e.g. ind.row = 1:100 in snp_cor(). If you could then store ind_val before the error occurs.

fmorgante commented 1 year ago

What I see is that there are some NaN in the x item in ind_val. See below

> dat[[87145]]$x
   [1]  0.0753193974  0.0690836139 -0.0670640191  0.0714285714  0.0714285714  0.0714285714 -0.1128871071
   [8] -0.1128406900 -0.0585521277 -0.1128406900  0.1071428571 -0.0393307573 -0.0421262732 -0.0421832327
  [15] -0.0421262732  0.1017963261  0.0391930901 -0.0488717735 -0.0226573547  0.0409036671  0.0486646444
  [22]  0.0786256905 -0.0740093319 -0.0736542924 -0.0736542924  0.0287236491 -0.0740364356 -0.0736542924
  [29]  0.0958314847  0.0512211772  0.0219884653 -0.1217832406 -0.1242946914 -0.1217832406 -0.1103159422
  [36] -0.0638876565  0.0651660981 -0.0736542924 -0.1262201803  0.0648005598 -0.0670732786  0.0663791536
  [43] -0.1116644500 -0.0736542924  0.0648005598 -0.1117537049 -0.0670732786 -0.0736542924  0.0651791384
  [50]  0.0672673472 -0.0739316898  0.0472944423 -0.1005037815 -0.0401305533           NaN           NaN
  [57] -0.1231577430 -0.0327038552 -0.0327038552           NaN  0.0189315591 -0.0161143666 -0.0335683050

Is that the problem?

privefl commented 1 year ago

No, I think that's just the variants with no variation, which can happen if you subsetted to only 100 individuals to compute correlations.

What is the sum of all $x lengths?

Then try to run the next lines https://github.com/privefl/bigsnpr/blob/master/R/corr.R#L43-L47

fmorgante commented 1 year ago

98457596

I have actually found out that I do not get that warning in this smaller example with 100 individuals -- those lines run without any problem.

Any idea about why I have issues with the larger n data?

As far as I can tell, the length of the $x depends only on the number of variants.

privefl commented 1 year ago

This number (of non-zero x values) is normal and very far from the max allowed (.Machine$integer.max), so there is no problem here.

And yes, this should not depend on the number of individuals used. Are you really sure this is the code you used for the previous large-N? You can also try with e.g. 5K indiv first, to see if you get something similar to with 100, if you do not want to wait for the computation with the full data (which should take a few hours).

fmorgante commented 1 year ago

A weird thing is happening. If I run the code interactively, I do not get any warning, no matter the sample size. If I run the code in batch mode with Rscript, I get the warning. Not sure why that happens, but I do not think it is a problem of bigsnpr.

Any ideas?

privefl commented 1 year ago

I have no idea what's going on.. :-(

fmorgante commented 1 year ago

I am going ahead and close the issue. I will reopen it in case I find out what's going on.

privefl commented 1 year ago

Thanks, and sorry for not being able to help more.