privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

`what(): Positions should be positive.` #212

Closed caimiao0714 closed 3 years ago

caimiao0714 commented 3 years ago

I’ve been trying to use the bigsnpr package to read the UK Biobank genetics data, but it seems that I can’t really replicate your code. I’ve been had “positions should be positive” error although (I think) I’m using the same code as you did. Here are my code:

pacman::p_load(doParallel, benchmarkme, bigreadr)

(n_cores = benchmarkme::get_cpu()$no_of_cores)

# ==================== Read in SNP IDs ====================
registerDoParallel(cl <- makeCluster(n_cores))
start_time = Sys.time()
system.time({
  list_snp_id <- foreach(chr = 1:22) %dopar% {
    # cat("Processing chromosome", chr, "..\n")
    mfi <- paste0("Data/ukb_imp_mfi/ukb_mfi_chr", chr, "_v3.txt")
    infos_chr <- bigreadr::fread2(mfi, showProgress = FALSE)
    infos_chr_sub <- subset(infos_chr, V6 > 0.01)  ## MAF > 1%
    bim <- paste0("Data/ukb_snp_bim/ukb_snp_chr", chr, "_v2.bim")
    map_chr <- bigreadr::fread2(bim)
    joined <- dplyr::inner_join(
      infos_chr_sub, map_chr,
      by = c("V3" = "V4", "V4" = "V5", "V5" = "V6")
    )
    with(joined, paste(chr, V3, V4, V5, sep = "_"))
  }
})
Sys.time() - start_time # Total amount of time: 30 sec
stopCluster(cl)

# ==================== Read in SNPs ====================
rds = bigsnpr::snp_readBGEN(
    bgenfiles = glue::glue("Data/imp/ukb_imp_chr{chr}_v3.bgen", chr = 1:2),
    list_snp_id = list_snp_id,
    backingfile = "Data/saved_tmp_1_2",
    #ind_row = ind.indiv[sub],
    bgi_dir = "Data/ukb_imp_bgi",
    ncores = benchmarkme::get_cpu()$no_of_cores
  )

The error message is:

    terminate called after throwing an instance of ‘Rcpp::exception’
           What(): Positions should be positive.
    Aborted (core dumped)

My questions is: I’m using the SNP id from the original mfi files and everything looks fine, I’m not sure why the position is not positive. Any comments/suggestions would be extremely useful and appreciated!

Thank you, Miao

privefl commented 3 years ago

Seems like an Rcpp issue that I have never seen before.

Which packageVersion("bigsnpr") do you have? I have recently added more checks before going into Rcpp to try catch possible errors before.

privefl commented 3 years ago

In fact, it seems to be one of my own assertions, which I have never encountered before. https://github.com/privefl/bigsnpr/blob/2768100433a5487aec7cf3832ee399bc8b738a31/src/read-bgen.cpp#L32

What is the range of the positions you're trying to read? Are there any 0s maybe?

caimiao0714 commented 3 years ago

Thank you! I’m sorry but what does these assertions do? I’m not sure why I have non-positive pos. I assume these SNP ids are correct, and I did not specify the ind_row in the function.

caimiao0714 commented 3 years ago

In fact, it seems to be one of my own assertions, which I have never encountered before. https://github.com/privefl/bigsnpr/blob/2768100433a5487aec7cf3832ee399bc8b738a31/src/read-bgen.cpp#L32

What is the range of the positions you're trying to read? Are there any 0s maybe?

How can I check the range of the positions? I did not specify the ind_row, so assumingly all the positions will be read.

privefl commented 3 years ago

When you read the MFI files.

caimiao0714 commented 3 years ago

I tried to validate if there is any non-positive positions, and it seems that my mfi files do NOT include any non-positive pos. I added a condition to force V3 > 0 in the subset sentence (see my code below).

system.time({
  list_snp_id <- foreach(chr = 1:22) %dopar% {
    # cat("Processing chromosome", chr, "..\n")
    mfi <- paste0("Data/ukb_imp_mfi/ukb_mfi_chr", chr, "_v3.txt")
    infos_chr <- bigreadr::fread2(mfi, showProgress = FALSE)
    infos_chr_sub <- subset(infos_chr, V6 > 0.01 & V3 > 0)  ## <------ I added V3 > 0 to make sure all pos are positive
    bim <- paste0("Data/ukb_snp_bim/ukb_snp_chr", chr, "_v2.bim")
    map_chr <- bigreadr::fread2(bim)
    joined <- dplyr::inner_join(
      infos_chr_sub, map_chr,
      by = c("V3" = "V4", "V4" = "V5", "V5" = "V6")
    )
    with(joined, paste(chr, V3, V4, V5, sep = "_"))
  }
}) 

I'm using R 3.6.1 on CentOS Linux 7, bigsnpr 1.6.1. Below is my session info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /export/home/caim29/software/anaconda3/envs/r3.2/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bigsnpr_1.6.1     bigstatsr_1.5.1   bigreadr_0.2.4    benchmarkme_1.0.7 doParallel_1.0.16 iterators_1.0.13 
[7] foreach_1.5.1    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6            benchmarkmeData_1.0.4 bigsparser_0.4.4      bigparallelr_0.3.1   
 [5] pillar_1.6.0          compiler_3.6.1        tools_3.6.1           gtable_0.3.0         
 [9] lifecycle_1.0.0       tibble_3.1.1          lattice_0.20-44       pkgconfig_2.0.3      
[13] rlang_0.4.11          Matrix_1.3-3          DBI_1.1.1             flock_0.7            
[17] yaml_2.2.1            bigassertr_0.1.4      dplyr_1.0.6           httr_1.4.2           
[21] generics_0.1.0        vctrs_0.3.8           cowplot_1.1.1         grid_3.6.1           
[25] tidyselect_1.1.1      data.table_1.14.0     glue_1.4.2            R6_2.5.0             
[29] fansi_0.4.2           parallelly_1.25.0     pacman_0.5.1          purrr_0.3.4          
[33] ggplot2_3.3.3         magrittr_2.0.1        scales_1.1.1          codetools_0.2-18     
[37] ellipsis_0.3.2        assertthat_0.2.1      colorspace_2.0-1      utf8_1.2.1           
[41] munsell_0.5.0         crayon_1.4.1     

I'm suspecting pos not positive is not the real issue here (based on my code above). Let me know if you have any thoughts/suggestions.

Thank you, Miao

privefl commented 3 years ago

BTW, what is sum(lengths(list_snp_id))?

caimiao0714 commented 3 years ago

BTW, what is sum(lengths(list_snp_id))?

I think this is just checking how many SNP IDs were included.

privefl commented 3 years ago

Yes, forget about that. I'll try to reproduce what you did. Do you get this error already when restricting to 1 chromosome? Do you get this error when using ncores = 1?

caimiao0714 commented 3 years ago

Yes, forget about that. I'll try to reproduce what you did. Do you get this error already when restricting to 1 chromosome? Do you get this error when using ncores = 1?

I spent a whole day testing all different combinations of chr index and the number of cores. It seems that I encounter the same exact error message (see below) whenever chr 1 is included, even if I only use 1 core. But the program can work perfectly well when I don't include chr 1, regardless of how many scores I use.

terminate called after throwing an instance of ‘Rcpp::exception’
           What(): Positions should be positive.
    Aborted (core dumped)
privefl commented 3 years ago

So, the problem is with chromosome 1 only? It is possible that your file is corrupted of some kind?

privefl commented 3 years ago

You can compare tools::md5sum("Data/imp/ukb_imp_chr1_v3.bgen") with https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997.

caimiao0714 commented 3 years ago

You can compare tools::md5sum("Data/imp/ukb_imp_chr1_v3.bgen") with https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997.

Thank you! I checked and it does seem to be a data issue. The md5 checksum of chr 1 in my data set does not match that of the uk biobank website, while chr 2 to 22 do match. I'll try to re-download the data.