[query] whole stage codegen fails with exit code 137 on partitions with lots of alleles

danking commented 1 year ago

What happened?

[reporter's note: IIRC, exit code 137 indicates that the "container" in which the worker JVM was executing exceeded memory limits. It seems likely that whole stage codegen has either (1) changed memory management in a way that uses more memory or (2) is newly lowering code that exposes a latent issue in memory management that uses too much (or leaks) memory.]

Reported by Ben Weisburd and Julia Goodrich.

[Ben is] running the first step of readviz for gnomAD v4 and we are hitting a 137 error on a partition that includes a site that has 27374 alleles.

His code is here

I was testing his code out on just that failing partition (just added mt = vds.variant_data._filter_partitions([41229])) and I was able to recreate the error using Hail 0.2.119 (this is what Ben was using when he hit the error on the full dataset). However, the first time I tried to recreate the error I was accidentally using a different version of Hail and it ran with no memory error. It seems that 0.2.117 runs without error, but 0.2.118 and 0.2.119 both hit the 137 error.

I am currently rerunning these tests so I can get logs:

Test with Hail 0.2.118:

Cluster:

hailctl dataproc start readviz-118 \
    --requester-pays-allow-all \
    --packages="git+https://github.com/broadinstitute/gnomad_methods.git@main","git+https://github.com/broadinstitute/gnomad_qc.git@main"  \
    --autoscaling-policy=max-20 \
    --master-machine-type n1-highmem-16 \
    --no-off-heap-memory \
    --worker-machine-type n1-highmem-8 \
    --max-idle 560m \
    --labels gnomad_release=gnomad_v4,gnomad_v4_testing=readviz_test_118

Command:

hailctl dataproc submit readviz-118 /Users/jgoodric/PycharmProjects/gnomad-readviz/step1__select_samples.py --sample-metadata-tsv gs://gnomad-readviz/v4.0/gnomad.exomes.v4.0.metadata.tsv.gz --output-ht-path gs://gnomad-tmp/julia/readviz/gnomad.exomes.v4.0.readviz_crams.part_41229.hail_118.ht
Job Link: https://console.cloud.google.com/dataproc/jobs/4db24eb6f93b491f8f07babc25c0d9c9/monitoring?region=us-central1&project=broad-mpg-gnomad

Test with Hail 0.2.117:

Cluster:

hailctl dataproc start readviz-117 \
    --requester-pays-allow-all \
    --packages="git+https://github.com/broadinstitute/gnomad_methods.git@main","git+https://github.com/broadinstitute/gnomad_qc.git@main"  \
    --autoscaling-policy=max-20 \
    --master-machine-type n1-highmem-16 \
    --no-off-heap-memory \
    --worker-machine-type n1-highmem-8 \
    --max-idle 560m \
    --labels gnomad_release=gnomad_v4,gnomad_v4_testing=readviz_test_117

Command:

hailctl dataproc submit readviz-117 /Users/jgoodric/PycharmProjects/gnomad-readviz/step1__select_samples.py --sample-metadata-tsv gs://gnomad-readviz/v4.0/gnomad.exomes.v4.0.metadata.tsv.gz --output-ht-path gs://gnomad-tmp/julia/readviz/gnomad.exomes.v4.0.readviz_crams.part_41229.hail_117.ht
Job Link: https://console.cloud.google.com/dataproc/jobs/7d89abedcfad44d4b831986806a4e248/monitoring?region=us-central1&project=broad-mpg-gnomad

I will update here with the logs when I have them, but in the meantime, do you see any problems with reverting back to 0.2.117 for this run?

Thanks!

Version

0.2.119

Relevant log output

No response

danking commented 1 year ago

Originally reported on Zulip

danking commented 1 year ago

I installed latest hail (0.2.120-f00f916faf78), gnomad (e6f042a74c91e462b77fca24d070c815e02f6f5b), and gnomad_qc (0c52cf47e48fa5b503d874e96482ea4286474c71). I cloned the repo in question

pip3 uninstall hail gnomad gnomad_qc

pip3 install -U \
    hail \
    git+https://github.com/broadinstitute/gnomad_methods.git \
    git+https://github.com/broadinstitute/gnomad_qc.git

git clone git@github.com:broadinstitute/gnomad-readviz.git

I applied this patch:

diff --git a/step1__select_samples.py b/step1__select_samples.py
index c159207..9ba1812 100644
--- a/step1__select_samples.py
+++ b/step1__select_samples.py
@@ -38,14 +38,7 @@ def hemi_expr(mt):

 def main(args):

-    hl.init(log="/select_samples", default_reference="GRCh38", idempotent=True, tmp_dir=args.temp_bucket)
-    meta_ht = hl.import_table(args.sample_metadata_tsv, force_bgz=True)
-    meta_ht = meta_ht.key_by("s")
-    meta_ht = meta_ht.filter(hl.is_defined(meta_ht.cram_path) & hl.is_defined(meta_ht.crai_path), keep=True)
-    meta_ht = meta_ht.repartition(1000)
-    meta_ht = meta_ht.checkpoint(
-        re.sub(".tsv(.b?gz)?", "", args.sample_metadata_tsv) + ".ht", overwrite=True, _read_if_exists=True)
-
+    hl.init(log="/tmp/select_samples", default_reference="GRCh38", idempotent=True, tmp_dir=args.temp_bucket)
     vds = gnomad_v4_genotypes.vds()

     # see https://github.com/broadinstitute/ukbb_qc/pull/227/files
@@ -55,19 +48,8 @@ def main(args):

     v4_qc_meta_ht = meta.ht()

-    mt = vds.variant_data
-    #mt = vds.variant_data._filter_partitions([41229])
-
-    mt = mt.filter_cols(v4_qc_meta_ht[mt.s].release)
-
-    meta_join = meta_ht[mt.s]
-    mt = mt.annotate_cols(
-        meta=hl.struct(
-            sex_karyotype=meta_join.sex_karyotype,
-            cram=meta_join.cram_path,
-            crai=meta_join.crai_path,
-        )
-    )
+    #mt = vds.variant_data
+    mt = vds.variant_data._filter_partitions([41229])

     logger.info("Adjusting samples' sex ploidy")
     lgt_expr = hl.if_else(
@@ -88,9 +70,9 @@ def main(args):
     logger.info("Filter variants with at least one non-ref GT")
     mt = mt.filter_rows(hl.agg.any(mt.GT.is_non_ref()))

-    #logger.info(f"Saving checkpoint")
-    #mt = mt.checkpoint(os.path.join(args.temp_bucket, "readviz_select_samples_checkpoint1.vds"),
-    #                   overwrite=True, _read_if_exists=True)
+    logger.info(f"Saving checkpoint")
+    mt = mt.checkpoint("readviz_select_samples_checkpoint1.vds",
+                       overwrite=True, _read_if_exists=True)

     def sample_ordering_expr(mt):
         """For variants that are present in more than 10 samples (or whatever number args.num_samples is set to),

And tried running the bad step:

python3 step1__select_samples.py

I was able to get past the checkpoint:

INFO (Readviz_prep 73): Saving checkpoint
[Stage 0:>                                                          (0 + 1) / 1]

2023-09-01 18:10:29.262 Hail: INFO: wrote matrix table with 11450 rows and 955359 columns in 1 partition to readviz_select_samples_checkpoint1.vds

@bw2 are you still encountering this issue? Did my diff oversimplify it? Do you suspect the issue after the checkpoint?

danking commented 1 year ago

Cross linking to an issue on the same partition that the gnomAD team discovered when running a different pipeline: https://github.com/hail-is/hail/issues/13584

hail-is / hail