projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

glow.normalize_variant fails with NullPointerException in NormalizeVariantExpr.scala:55 #536

Open nickorka opened 9 months ago

nickorka commented 9 months ago

I'm trying to implement variant normalization function. I'm calling it within a dataframe like this:

.withColumn('normalizationResult',
                F.when((F.length(F.col('ss_other_allele')) > 1) & (
                        (F.length(F.col('trim_ref')) > 0) |
                        (F.length(F.col('trim_alt')) > 0)
                ),
                     glow.normalize_variant("contigName", "start", "end", "referenceAllele", "alternateAlleles", ref_path)
                     ).otherwise(None)
            )

I'm preparing "contigName", "start", "end", "referenceAllele", "alternateAlleles" field before the call, and I've checked there is no any NULL values in any of the fields. During Spark action call I'm getting this error:

23/10/12 00:33:15 ERROR TaskContextImpl: Error in TaskCompletionListener
java.lang.NullPointerException: null
    at io.projectglow.sql.expressions.NormalizeVariantExpr$.$anonfun$doVariantNormalization$1(NormalizeVariantExpr.scala:55) ~[io.projectglow_glow-spark3_2.12-1.2.1.jar:1.2.1]
    at io.projectglow.sql.expressions.NormalizeVariantExpr$.$anonfun$doVariantNormalization$1$adapted(NormalizeVariantExpr.scala:54) ~[io.projectglow_glow-spark3_2.12-1.2.1.jar:1.2.1]
    at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:132) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.$anonfun$invokeTaskCompletionListeners$1(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.$anonfun$invokeTaskCompletionListeners$1$adapted(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:199) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.invokeTaskCompletionListeners(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:137) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
    at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
23/10/12 00:33:15 ERROR Executor: Exception in task 3.0 in stage 14.0 (TID 88)
org.apache.spark.util.TaskCompletionListenerException: null
    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:254) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.invokeTaskCompletionListeners(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:137) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
    at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]
    Suppressed: java.lang.NullPointerException
        at io.projectglow.sql.expressions.NormalizeVariantExpr$.$anonfun$doVariantNormalization$1(NormalizeVariantExpr.scala:55) ~[io.projectglow_glow-spark3_2.12-1.2.1.jar:1.2.1]
        at io.projectglow.sql.expressions.NormalizeVariantExpr$.$anonfun$doVariantNormalization$1$adapted(NormalizeVariantExpr.scala:54) ~[io.projectglow_glow-spark3_2.12-1.2.1.jar:1.2.1]
        at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:132) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContextImpl.$anonfun$invokeTaskCompletionListeners$1(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContextImpl.$anonfun$invokeTaskCompletionListeners$1$adapted(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:199) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContextImpl.invokeTaskCompletionListeners(TaskContextImpl.scala:144) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:137) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.1-amzn-1.jar:3.4.1-amzn-1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_382]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_382]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_382]

I've tried to run just this part of dataframe from pyspark session manually, there were no any errors. But when I run whole pipeline with all joins it's failing just on this step for multiple containers. Here you can see the executor stats: glow_error

I'm running this on Spark 3.4.1 with 6G executors and 3G driver.

It looks like Glow cannot find a listener for a specific task. Can you help me with this, please?

henrydavidge commented 3 months ago

That's quite strange. Do you still see this error with the new version of Glow?

nickorka commented 3 months ago

I don't know. Maybe the problem is still there. I've found a workaround by dumping whole dataframe to parquet file on HDFS and continue the step with the parquet file instead of dealing with long query.

nickorka commented 3 months ago

By the way, the new version 2.0.0 is not even initialize. It fails on import glow with some numpy compatibility error. There is no backward compatibility at all.

henrydavidge commented 3 months ago

Try pip install -U glow.py