Closed fmorgante closed 1 year ago
How many variants on e.g. chr1?
> library(bigsnpr)
Loading required package: bigstatsr
> geno <- snp_attach("/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.rds")
> str(geno)
List of 2
$ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
..$ extptr :<externalptr>
..$ extptr_rw :<externalptr>
..$ nrow : int 152434
..$ ncol : int 1054330
..$ type : Named int 1
.. ..- attr(*, "names")= chr "unsigned char"
..$ backingfile : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.bk"
..$ is_read_only: logi FALSE
..$ address :<externalptr>
..$ address_rw :<externalptr>
..$ bk : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.bk"
..$ rds : chr "/scratch1/fabiom/ukb_geno_imp_HM3_tiezzi_for_LD.rds"
..$ is_saved : logi TRUE
..$ type_chr : chr "unsigned char"
..$ type_size : int 1
..$ file_size : num 1.61e+11
..$ code256 : num [1:256] 0 1 2 NA 0 1 2 0 0.01 0.02 ...
..and 26 methods, of which 12 are possibly relevant:
.. add_columns, as.FBM, bm, bm.desc, check_dimensions, check_write_permissions, copy#envRefClass, initialize,
.. initialize#FBM, save, show#envRefClass, show#FBM
$ map : tibble [1,054,330 × 8] (S3: tbl_df/tbl/data.frame)
..$ chromosome : chr [1:1054330] "01" "01" "01" "01" ...
..$ marker.ID : chr [1:1054330] "rs3131972" "1:754182_A_G" "1:760912_C_T" "rs12562034" ...
..$ rsid : chr [1:1054330] "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
..$ physical.pos: int [1:1054330] 752721 754182 760912 768448 779322 838555 846808 853954 854250 864938 ...
..$ allele1 : chr [1:1054330] "A" "A" "C" "G" ...
..$ allele2 : chr [1:1054330] "G" "G" "T" "A" ...
..$ freq : num [1:1054330] 0.841 0.87 0.84 0.105 0.128 ...
..$ info : num [1:1054330] 1 0.993 0.991 1 1 ...
- attr(*, "class")= chr "bigSNP"
> table(geno$map$chromosome)
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
87145 87521 73414 65490 66330 71046 58036 57084 48475 56549 53965 51492 40130 35093 31768 32431 28771 31497 19803
20 21 22
27728 15113 15449
That's the genotype data I am using as input to snp_cor
.
inds <- which(geno$map$chromosome==chr)
POS <- snp_asGeneticPos(as.integer(geno$map$chromosome), geno$map$physical.pos,
dir=genetic_map,
ncores=ncores)
LD <- snp_cor(Gna=geno$genotypes, ind.col=inds, size=0.003, infos.pos=POS[inds], thr_r2=0, ncores=ncores)
where genetic_map
points to 1000-genomes-genetic-maps/interpolated_OMNI
.
In inds
, do you have all indices, or do you do it per chromosome?
I do it per-chromosome, so inds
includes the indices of the chosen chromosome -- see second block of code.
Can you share POS[inds]
, for e.g. chr 1?
Here it is.
That's the same of what I have, so I don't understand what's the problem.
What do you have for packageVersion("bigsnpr")
, packageVersion("Matrix")
, and R.version.string
?
> packageVersion("bigsnpr")
[1] ‘1.12.2’
> packageVersion("Matrix")
[1] ‘1.5.3’
> R.version.string
[1] "R version 4.2.3 (2023-03-15)"
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.5 (Green Obsidian)
Matrix products: default
BLAS/LAPACK: /opt/intel/oneapi/mkl/2023.0.0/lib/intel64/libmkl_gf_lp64.so.2
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bigsnpr_1.12.2 bigstatsr_1.5.12
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 magrittr_2.0.3 cowplot_1.1.1 tidyselect_1.2.0
[5] munsell_0.5.0 doParallel_1.0.17 lattice_0.20-45 colorspace_2.1-0
[9] R6_2.5.1 doRNG_1.8.6 rlang_1.1.0 foreach_1.5.2
[13] fansi_1.0.4 dplyr_1.1.1 rngtools_1.5.2 parallel_4.2.3
[17] grid_4.2.3 data.table_1.14.8 gtable_0.3.3 utf8_1.2.3
[21] cli_3.6.1 bigparallelr_0.3.2 iterators_1.0.14 digest_0.6.31
[25] tibble_3.2.1 lifecycle_1.0.3 Matrix_1.5-3 bigsparser_0.6.1
[29] bigassertr_0.1.6 ggplot2_3.4.2 flock_0.7 vctrs_0.6.1
[33] codetools_0.2-19 glue_1.6.2 compiler_4.2.3 pillar_1.9.0
[37] generics_0.1.3 scales_1.2.1 pkgconfig_2.0.3
Do you know how to use debugonce("snp_cor")
to run some code inside a function line by line?
The issue probably occurs here: https://github.com/privefl/bigsnpr/blob/master/R/corr.R#L46.
If you could try this quickly, you can subset to e.g. ind.row = 1:100
in snp_cor()
.
If you could then store ind_val
before the error occurs.
What I see is that there are some NaN in the x
item in ind_val
. See below
> dat[[87145]]$x
[1] 0.0753193974 0.0690836139 -0.0670640191 0.0714285714 0.0714285714 0.0714285714 -0.1128871071
[8] -0.1128406900 -0.0585521277 -0.1128406900 0.1071428571 -0.0393307573 -0.0421262732 -0.0421832327
[15] -0.0421262732 0.1017963261 0.0391930901 -0.0488717735 -0.0226573547 0.0409036671 0.0486646444
[22] 0.0786256905 -0.0740093319 -0.0736542924 -0.0736542924 0.0287236491 -0.0740364356 -0.0736542924
[29] 0.0958314847 0.0512211772 0.0219884653 -0.1217832406 -0.1242946914 -0.1217832406 -0.1103159422
[36] -0.0638876565 0.0651660981 -0.0736542924 -0.1262201803 0.0648005598 -0.0670732786 0.0663791536
[43] -0.1116644500 -0.0736542924 0.0648005598 -0.1117537049 -0.0670732786 -0.0736542924 0.0651791384
[50] 0.0672673472 -0.0739316898 0.0472944423 -0.1005037815 -0.0401305533 NaN NaN
[57] -0.1231577430 -0.0327038552 -0.0327038552 NaN 0.0189315591 -0.0161143666 -0.0335683050
Is that the problem?
No, I think that's just the variants with no variation, which can happen if you subsetted to only 100 individuals to compute correlations.
What is the sum of all $x
lengths?
Then try to run the next lines https://github.com/privefl/bigsnpr/blob/master/R/corr.R#L43-L47
98457596
I have actually found out that I do not get that warning in this smaller example with 100 individuals -- those lines run without any problem.
Any idea about why I have issues with the larger n data?
As far as I can tell, the length of the $x
depends only on the number of variants.
This number (of non-zero x
values) is normal and very far from the max allowed (.Machine$integer.max
), so there is no problem here.
And yes, this should not depend on the number of individuals used. Are you really sure this is the code you used for the previous large-N? You can also try with e.g. 5K indiv first, to see if you get something similar to with 100, if you do not want to wait for the computation with the full data (which should take a few hours).
A weird thing is happening. If I run the code interactively, I do not get any warning, no matter the sample size. If I run the code in batch mode with Rscript
, I get the warning. Not sure why that happens, but I do not think it is a problem of bigsnpr
.
Any ideas?
I have no idea what's going on.. :-(
I am going ahead and close the issue. I will reopen it in case I find out what's going on.
Thanks, and sorry for not being able to help more.
Hi Florian, I have gotten the following warning when running
snp_cor
to compute the per-chromosome sparse LD matrix for chromosomes 1-3,5,6 from ~155K UKB individuals from imputed data.Is this something you have encountered?