hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
957 stars 238 forks source link

[query] Zip length mismatch error on join #13486

Open chrisvittal opened 11 months ago

chrisvittal commented 11 months ago

from: https://discuss.hail.is/t/zip-length-mismatch-error/3548

> 
> 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3
INFO (constraint_pipeline 463): Copying log to logging bucket...
2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'...
Traceback (most recent call last):
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in <module>
    main(args)
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main
    oe_ht = apply_models(
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht
  File "<decorator-gen-1118>", line 2, in checkpoint
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "<decorator-gen-1120>", line 2, in write
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
    raise e.maybe_user_error(ir) from None
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
    at __C8160Compiled.__m8201split_ToArray(Emit.scala)
    at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
    at __C8160Compiled.__m8164split_Let(Emit.scala)
    at __C8160Compiled.apply(Emit.scala)
    at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
    at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
    at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
    at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
    at is.hail.utils.package$.using(package.scala:635)
    at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
    at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
    at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
    at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
    at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
    at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
    at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
    at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
    at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
    at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
    at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
    at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
    at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
    at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
    at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
    at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
    at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
    at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
    at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
    at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
    at is.hail.utils.package$.using(package.scala:635)
    at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
    at is.hail.utils.package$.using(package.scala:635)
    at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
    at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
    at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
    at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
    at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
    at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)

Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.

I've obtained a log from a failed run and do see us zipping the two contexts together without making sure they're the same length.

chrisvittal commented 11 months ago

This seems to be the point in the pipeline where the join occurs.

def annotate_with_mu(
    ht: hl.Table,
    mutation_ht: hl.Table,
    mu_annotation: str = "mu_snp",
) -> hl.Table:
    """
    Annotate SNP mutation rate for the input Table.

    .. note::

        Function expects that`ht` includes`mutation_ht`'s key fields. Note that these
        annotations don't need to be the keys of `ht`.

    :param ht: Input Table to annotate.
    :param mutation_ht: Mutation rate Table.
    :param mu_annotation: The name of mutation rate annotation in `mutation_ht`.
        Default is 'mu_snp'.
    :return: Table with mutational rate annotation added.
    """
    mu = mutation_ht.index(*[ht[k] for k in mutation_ht.key])[mu_annotation]
    return ht.annotate(
        **{mu_annotation: hl.case().when(hl.is_defined(mu), mu).or_error("Missing mu")}
    )
chrisvittal commented 10 months ago

Failed log here: constraint_pipeline.log

danking commented 9 months ago

@chrisvittal , is this replicable or transient?

danking commented 9 months ago

Waiting until after ASHG to pick this up again. Talk to to Kristin to confirm its replicable.

danking commented 6 months ago

Another very simple pipeline reported https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/zip.3A.20length.20mismatch . We can get access to these files via Sam B.

context_mis_freq_ht = hl.read_table("gs://epi25/misc-data/gnomAD_v4/grch38_context_vep_annotated.v105.prefiltered.missense_freq_ensp.ht")
ensp2uniprot_ht = hl.import_table("gs://epi-mis-3d/misc/ensp2uniprot_mart_export.ensp2uniprot.txt")

context_mis_freq_ht = context_mis_freq_ht.key_by("ensp")
ensp2uniprot_ht = ensp2uniprot_ht.key_by("ensp")

context_mis_freq_ht = context_mis_freq_ht.annotate(
    uniprot = ensp2uniprot_ht[context_mis_freq_ht.ensp].uniprot)

notice that the error is removed if you instead use:

context_mis_freq_ht = hl.read_table("gs://epi25/misc-data/gnomAD_v4/grch38_context_vep_annotated.v105.prefiltered.missense_freq_ensp.ht")
ensp2uniprot_ht = hl.import_table("gs://epi-mis-3d/misc/ensp2uniprot_mart_export.ensp2uniprot.txt")

context_mis_freq_ht = context_mis_freq_ht.key_by("ensp")
ensp2uniprot_ht = ensp2uniprot_ht.key_by("ensp")

context_mis_freq_ht = context_mis_freq_ht.join(ensp2uniprot_ht,'left')