Open danking opened 1 year ago
I installed latest hail (0.2.120-f00f916faf78), gnomad (e6f042a74c91e462b77fca24d070c815e02f6f5b), and gnomad_qc (0c52cf47e48fa5b503d874e96482ea4286474c71). I cloned the repo in question
pip3 uninstall hail gnomad gnomad_qc
pip3 install -U \
hail \
git+https://github.com/broadinstitute/gnomad_methods.git \
git+https://github.com/broadinstitute/gnomad_qc.git
git clone git@github.com:broadinstitute/gnomad-readviz.git
I applied this patch:
diff --git a/step1__select_samples.py b/step1__select_samples.py
index c159207..9ba1812 100644
--- a/step1__select_samples.py
+++ b/step1__select_samples.py
@@ -38,14 +38,7 @@ def hemi_expr(mt):
def main(args):
- hl.init(log="/select_samples", default_reference="GRCh38", idempotent=True, tmp_dir=args.temp_bucket)
- meta_ht = hl.import_table(args.sample_metadata_tsv, force_bgz=True)
- meta_ht = meta_ht.key_by("s")
- meta_ht = meta_ht.filter(hl.is_defined(meta_ht.cram_path) & hl.is_defined(meta_ht.crai_path), keep=True)
- meta_ht = meta_ht.repartition(1000)
- meta_ht = meta_ht.checkpoint(
- re.sub(".tsv(.b?gz)?", "", args.sample_metadata_tsv) + ".ht", overwrite=True, _read_if_exists=True)
-
+ hl.init(log="/tmp/select_samples", default_reference="GRCh38", idempotent=True, tmp_dir=args.temp_bucket)
vds = gnomad_v4_genotypes.vds()
# see https://github.com/broadinstitute/ukbb_qc/pull/227/files
@@ -55,19 +48,8 @@ def main(args):
v4_qc_meta_ht = meta.ht()
- mt = vds.variant_data
- #mt = vds.variant_data._filter_partitions([41229])
-
- mt = mt.filter_cols(v4_qc_meta_ht[mt.s].release)
-
- meta_join = meta_ht[mt.s]
- mt = mt.annotate_cols(
- meta=hl.struct(
- sex_karyotype=meta_join.sex_karyotype,
- cram=meta_join.cram_path,
- crai=meta_join.crai_path,
- )
- )
+ #mt = vds.variant_data
+ mt = vds.variant_data._filter_partitions([41229])
logger.info("Adjusting samples' sex ploidy")
lgt_expr = hl.if_else(
@@ -88,9 +70,9 @@ def main(args):
logger.info("Filter variants with at least one non-ref GT")
mt = mt.filter_rows(hl.agg.any(mt.GT.is_non_ref()))
- #logger.info(f"Saving checkpoint")
- #mt = mt.checkpoint(os.path.join(args.temp_bucket, "readviz_select_samples_checkpoint1.vds"),
- # overwrite=True, _read_if_exists=True)
+ logger.info(f"Saving checkpoint")
+ mt = mt.checkpoint("readviz_select_samples_checkpoint1.vds",
+ overwrite=True, _read_if_exists=True)
def sample_ordering_expr(mt):
"""For variants that are present in more than 10 samples (or whatever number args.num_samples is set to),
And tried running the bad step:
python3 step1__select_samples.py
I was able to get past the checkpoint:
INFO (Readviz_prep 73): Saving checkpoint
[Stage 0:> (0 + 1) / 1]
2023-09-01 18:10:29.262 Hail: INFO: wrote matrix table with 11450 rows and 955359 columns in 1 partition to readviz_select_samples_checkpoint1.vds
@bw2 are you still encountering this issue? Did my diff oversimplify it? Do you suspect the issue after the checkpoint?
Cross linking to an issue on the same partition that the gnomAD team discovered when running a different pipeline: https://github.com/hail-is/hail/issues/13584
What happened?
[reporter's note: IIRC, exit code 137 indicates that the "container" in which the worker JVM was executing exceeded memory limits. It seems likely that whole stage codegen has either (1) changed memory management in a way that uses more memory or (2) is newly lowering code that exposes a latent issue in memory management that uses too much (or leaks) memory.]
Reported by Ben Weisburd and Julia Goodrich.
[Ben is] running the first step of readviz for gnomAD v4 and we are hitting a 137 error on a partition that includes a site that has 27374 alleles.
His code is here
I was testing his code out on just that failing partition (just added mt = vds.variant_data._filter_partitions([41229])) and I was able to recreate the error using Hail 0.2.119 (this is what Ben was using when he hit the error on the full dataset). However, the first time I tried to recreate the error I was accidentally using a different version of Hail and it ran with no memory error. It seems that 0.2.117 runs without error, but 0.2.118 and 0.2.119 both hit the 137 error.
I am currently rerunning these tests so I can get logs:
Test with Hail 0.2.118:
Cluster:
Command:
Test with Hail 0.2.117:
Cluster:
Command:
I will update here with the logs when I have them, but in the meantime, do you see any problems with reverting back to 0.2.117 for this run?
Thanks!
Version
0.2.119
Relevant log output
No response