Closed jjfarrell closed 5 years ago
How much memory do your worker machines have?
16 nodes with between 128GB to 256 GB
The script is running fine on the smaller chromosome 19 to 22 bgen files so far. However, I noticed each were running just 24 cores even though we have 16 nodes * 16 cores each available on the cluster.
index_bgen can't be parallelized -- it needs to do a linear scan through the files to find variant byte offsets.
The rest should be parallelized just fine. Do you see only 24 cores working on the write?
I'd also note that BGEN is probably a much faster format than MT for representing the data in your script -- the MT will probably be ~10x bigger, since you're going to realize a bunch of stuff that can be computed from the BGEN dosages.
I see what is happening.
The Hail cluster install instructions specify the following for a spark cluster:
export PYSPARK_SUBMIT_ARGS="\ --jars $HAIL_HOME/build/libs/hail-all-spark.jar \ --conf spark.driver.extraClassPath=\"$HAIL_HOME/build/libs/hail-all-spark.jar\" \ --conf spark.executor.extraClassPath=./hail-all-spark.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator pyspark-shell"
On our cluster, this will run as a local job. It needs a "--master yarn" for an argument. Running it locally probably is related to the out of memory error and the limited cores. I will rerun this with the --master yarn argument.
Regarding the bgen file versus matrix table, are you suggesting, it would be faster to run an analysis such as a logistic regression starting with the bgen file instead of the imported bgen mt file. The phenotypes would need to annotated the imported bgen mt every time. Just trying to understand the trade offs.
The phenotypes would need to annotated the imported bgen mt every time
This is very cheap, especially compared to the extra IO/decoding burden.
I should note, though, that in the next year we'll start to develop new types of file encodings that should let us represent this data as efficiently as the BGEN in a faster way (using a faster compression codec than zlib)
To get around this running this, I increased the spark memory requested with the PYSPARK_SUBMIT_ARGS :.
--conf spark.driver.memory=5G\ --conf spark.executor.memory=30G\
To report a bug, fill in the information below. For support and feature requests, please use the discussion forum: http://discuss.hail.is/
Hail version:
version 0.2-721af83bc30a
What you did:
Import UK Biobank bgen chr10
import hail as hl import sys hl.init() chr=sys.argv[1] bgen="/project/ukbiobank/imp/uk.v3/bgen/ukb_imp_chr"+chr+"_v3.bgen" sample="/project/ukbiobank/imp/uk.v3/bgen/ukb19416_imp_chr"+chr+"_v3_s487327.sample" mt="/project/ukbiobank/imp/uk.v3/mt/ukbb_imp_chr"+chr+"_v3_s487327.mt" hl.index_bgen(bgen) hl.import_bgen(bgen,sample_file=sample,entry_fields=['GT', 'GP','dosage']).write(mt)
What went wrong (all error messages here, including the full java stack trace):