tpook92 / HaploBlocker

R-package: Calculation of haplotype blocks and libraries
GNU General Public License v3.0
28 stars 2 forks source link

parallel window, window overlap - subscript out of bounds error #8

Closed matt-shenton closed 3 years ago

matt-shenton commented 3 years ago

Hi there, thanks for this great tool.

I'm trying to calculate blocks for a chromosome with variant count: 775274, column count: 629.

I'm running it using Rscript

The command is as below

blocklist<-block_calculation(paste0(invcf,".vcf"),bp_map=mybpvector,adaptive_mode=TRUE,big_output=TRUE,overlap_remove=FALSE,window_cores=12,parallel_window=pwn,window_overlap=won)

If I use parallel window 5000, window overlap 1000 I get the subscript out of bounds error

dhm[1:(parallel_window + window_overlap) + (index - 1) * parallel_window,

でエラー: 添え字が許される範囲外です 呼び出し: block_calculation 実行が停止されました

(Sorry this is a Japanese linux server, so the errors are in Japanese).

I don't know the size of the blocks in advance, so it is a little bit trial and error to choose the parameters.

If I choose parallel window 10000, window overlap 1000 I don't have any errors. However, for some chromosomes, I run out of memory using very large windows.

Should I expect this error if the overlap is larger than the largest block? Is there a better way to choose the parallel window size and overlap?

Many thanks indeed for your help.

Best wishes Matt Shenton

tpook92 commented 3 years ago

Dear Matt,

I so far was not able to reproduce the issue with a dataset and same window size / overlap. Would it be possible to share the vcf-file for me to have to deeper look into the problem? Ideally of course on a reduced panel that is still leading to the issue.

Regarding the memory issues, I would assume that a lower number of parallel sections should reduce memory needs by quite a bit. This part of the code also does not seem to be efficiently coded when working with 200 windows separate windows - with some work on my part that should be reducible by at least 30%.

Best regards Torsten

matt-shenton commented 3 years ago

Dear Torsten,

Many thanks for your quick response.

I tried again using the first 11 samples, and got the same error using parallel_window=5000 and window_overlap=1000.

dhm[1:(parallel_window + window_overlap) + (index - 1) * parallel_window, でエラー: 添え字が許される範囲外です 呼び出し: block_calculation 実行が停止されました

(again, sorry for the Japanese error message. I am pretty sure this is "subscript out of bounds" in English)

I think you can access the vcf file and the bpmap file at the links below:

https://drive.google.com/file/d/1O0dvwTcpLfiqOZT2naz6AHYkTQGfHSrK/view?usp=sharing https://drive.google.com/file/d/1j0CAK4NPru_yWjD7HzvW5ZhKuN3m6EqL/view?usp=sharing

Here is the script I used to run the program

`args = commandArgs(trailingOnly=TRUE) library(vcfR) library(HaploBlocker)

invcf<-args[1] pw<-args[2] wo<-args[3] mybpmap<-read.table(paste0(invcf,".bpmap"),header=FALSE) mybpvector<-mybpmap[,1] pwn<-as.integer(pw) won<-as.integer(wo) blocklist<-block_calculation(paste0(invcf,".vcf"),bp_map=mybpvector,adaptive_mode=TRUE,big_output=TRUE,overlap_remove=FALSE,window_cores=12,parallel_window=pwn,window_overlap=won)

saver<-paste0(invcf,"",pw,"",wo,"overlap.rds") saveRDS(object = blocklist, file = saver)

png(paste0(invcf,"",pw,"",wo,"overlap.png"),width=4800,height=2400)

plot_block(blocklist[[1]],add_sort=FALSE) dev.off()`

I call it using qsub on the univa grid engine system using the following:

`#!/bin/bash

$ -S /bin/bash

$ -cwd

$ -jc M.c12

conda activate haplotype_blocker

invcf=$1 pw=$2 wo=$3

/lustre/home/mshenton/anaconda3/envs/haplotype_blocker/bin/Rscript haploblock_no_remove_pw_wo_test.R ${invcf} ${pw} ${wo} `

The conda environment export is attached here: haplotype_blocker_env.yml.txt

Thanks again for your help.

Best wishes

Matt

tpook92 commented 3 years ago

I just uploaded a new version of the package (1.6.00) that should resolve the issue.

Furthermore, the new version will be more memory efficient when importing files from ped/vcf, automatically detect bp positions if they are available, not require unzipping and not depend on vcfR anymore.

On default, the two haplotypes of a line are used separately, therefore I introduced a new parameter (inbred in block_calcuation() ) that leads to only one haplotype per line being considered.

This new version of the packages requires RandomFieldsUtils 0.6.6 or higher as the internal storage structure was unified to the storage structure in the miraculix packages (which is used in my package MoBPS). There will another new RandomFieldsUtils version soon, but this will only affect MoBPS part. HaploBlocker parts should all be set.

Thanks for reporting this issue and helping me solve it!

matt-shenton commented 3 years ago

Dear Torsten,

That is impressively quick!

I installed the new version and the problem is solved.

There are a couple of queries - it did not behave well with a vcf.gz file, and still seems to depend on vcfR

is that a type in data_import.R? :

if(data_type==".vcf" || "f.gz"){ if(verbose) cat("Data input identified as vcf-file - extract genomic information. \n") if(requireNamespace("vcfR", quietly = TRUE)){

One haplotype per line for inbred genomes sounds good - these are rice lines I am using. We still get some heterozygous calls in the inbred lines - presumably due to duplicated sequences etc. Does haploblocker just ignore the heterozygous sites in a vcf file?

Many thanks again for your superquick response.

Best wishes

Matt

matt-shenton commented 3 years ago

Dear Torsten,

Sorry, one more query.

Should I be concerned about these messages on loading Haploblocker?

library("HaploBlocker") 要求されたパッケージ RandomFieldsUtils をロード中です chol (own), 1 cores chol (R) 'RandomFieldsUtils' sees OMP, SSE2, but not AVX2, AVX. Without appropriate SIMD instruction set, the calculations might be slow. Consider recompiling 'RandomFieldsUtils' and 'RandomFieldsUtils' with flags e.g.,

install.packages("RandomFieldsUtils", configure.args="CXX_FLAGS='-march=native -Xpreprocessor -fopenmp -pthread' LIB_FLAGS='-lomp -pthread'")

install.packages("RandomFieldsUtils", configure.args="CXX_FLAGS='-mavx2 -Xpreprocessor -fopenmp -pthread' LIB_FLAGS='-lomp -pthread'")

install.packages("RandomFieldsUtils", configure.args="CXX_FLAGS='-mavx -Xpreprocessor -fopenmp -pthread' LIB_FLAGS='-lomp -pthread'")

Alternatively consider installing 'RandomFieldsUtils' from https://github.com/schlather/RandomFieldsUtils, i.e., install.packages("devtools") library(devtools) devtools::install_github("schlather/RandomFieldsUtils/pkg")

tpook92 commented 3 years ago

The current way of handling it would be to just take the first of the two haplotypes of an individual. I would not be as concerned about the heterozygous calls as HaploBlocker should be robust enough to handle minor deviations. However, if you want to apply your own filtering // select with of the two variants to chose you can also import the dataset via a matrix-input of dhm (SNP x Haplotypes - matrix). Exemplary code on how to do that should be in import_data.R.

Could you be more specific on the import issues? vcfR is still the default to use when available and still not 100% general (e.g. would fail when working with more than 10 variants) - more of an additional server when people cant use a matrix input in dhm.

These messages on loading in the library stem from the RandomFieldsUtils package. This package contains a variety of utility functions for various R-packages (RandomFields, MoBPS, HaploBlocker). This includes a cholesky decomposition - to ensure maximum performance depending on BLAS etc. on start we are currently running some background checks when loading in the package. As state in the new commit - this is still somewhat of a work in progress and an update for RandomFieldsUtils will follow soon. This does not affect any HaploBlocker related code as no cholesky decomposition is needed.

matt-shenton commented 3 years ago

Dear Torsten,

Sorry for the lack of detail.

My mistake, it can work without vcfR, I was rushing and didn't check properly on that point. Apologies for misleading you on that one.

For reading the gz compressed file, the line: if(data_type==".vcf" || "f.gz"){

seems to be causing the problem.

The error is:

data_type == ".vcf" || "f.gz" でエラー: 'x 'y' y' の || の型が不正です

my translation is not good but something like:

error in data_type == ".vcf" || "f.gz": The type of || in'x'y' y'is invalid

Now I look at it, I am wondering if this is some kind of locale specific problem? The error above came from via job submitted via univa grid engine

At the R command line it gives a slightly different error

dhm<-"test8.vcf.gz" data_type <- substr(dhm, start= nchar(dhm)-3, stop= nchar(dhm)) data_type [1] "f.gz" if ((data_type==".vcf") || (data_type=="f.gz")){print("yes")}else{print("no")} [1] "yes" if (data_type==".vcf" || "f.gz"){print("yes")}else{print("no")} data_type == ".vcf" || "f.gz" でエラー: 'x || y' の 'y' の型が不正です

error in data_type == ".vcf" || "f.gz": The type of y in 'x || y' is invalid

So seems the error is slightly different in the environment using Rscript launched by qsub in a shellscript, compared with the command line It's a little mysterious to me, but I am not an expert in R at all. Something related to different quotation marks in this Japanese machine, perhaps.

The actual file reading part seems fine

dhm<-"test8.vcf.gz" data_type <- substr(dhm, start= nchar(dhm)-3, stop= nchar(dhm)) data_type [1] "f.gz" vcf_file <- utils::read.table(dhm) dim(vcf_file) [1] 770751 20

On the point about heterozygous sites, thanks alot for the info. Skipping the phasing part would certainly speed up the process.

Also thanks for the reassurance about the messages on loading.

Thanks again Matt

another experiment:

dhm<-"test8.vcf.gz" types<-c(".vcf","f.gz") data_type <- substr(dhm, start= nchar(dhm)-3, stop= nchar(dhm)) data_type [1] "f.gz" data_type %in% types [1] TRUE if (data_type %in% types){print("yes")}else{print("no")} [1] "yes"

tpook92 commented 3 years ago

dumb typo on my part, sorry. helps to test code before pushing it to master...