projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

Sample Quality Control notebook broken #425

Closed dberma15 closed 2 years ago

dberma15 commented 2 years ago

The Sample Quality Control notebook breaks when trying to run it on databricks with runtime version 9.0, spark 3.1.2.

Here's the notebook in question: https://glow.readthedocs.io/en/latest/_static/notebooks/etl/sample-qc-demo.html

The error is on the fourth command: display(qc.selectExpr("explode(qc) as per_sample_qc").selectExpr("expand_struct(per_sample_qc)"))

The following is the error message:

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-3319344438369069> in <module>
----> 1 display(qc.selectExpr("explode(qc) as per_sample_qc").selectExpr("expand_struct(per_sample_qc)"))

/databricks/spark/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
   1707         if len(expr) == 1 and isinstance(expr[0], list):
   1708             expr = expr[0]
-> 1709         jdf = self._jdf.selectExpr(self._jseq(expr))
   1710         return DataFrame(jdf, self.sql_ctx)
   1711 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o337.selectExpr.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
    at io.projectglow.sql.expressions.ExpandStruct.$anonfun$expand$1(glueExpressions.scala:45)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.immutable.List.map(List.scala:298)
    at io.projectglow.sql.expressions.ExpandStruct.expand(glueExpressions.scala:43)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$.$anonfun$expandExprs$1(hlsOptimizerRules.scala:82)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$.io$projectglow$sql$optimizer$ResolveExpandStructRule$$expandExprs(hlsOptimizerRules.scala:81)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$$anonfun$apply$3.applyOrElse(hlsOptimizerRules.scala:67)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$$anonfun$apply$3.applyOrElse(hlsOptimizerRules.scala:65)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:137)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:86)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:137)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:330)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:133)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning$(AnalysisHelper.scala:129)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUpWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:110)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:109)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$.apply(hlsOptimizerRules.scala:65)
    at io.projectglow.sql.optimizer.ResolveExpandStructRule$.apply(hlsOptimizerRules.scala:63)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$3(RuleExecutor.scala:221)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:221)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:89)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:218)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:210)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:210)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:285)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:278)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:224)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:188)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:109)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:188)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:260)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:337)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:259)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:96)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:134)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:180)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:180)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:97)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:94)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:94)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:92)
    at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3849)
    at org.apache.spark.sql.Dataset.select(Dataset.scala:1489)
    at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1523)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
williambrandler commented 2 years ago

hey @dberma15 what version of glow are you using?

Please try Glow v1.1.0 (not 1.0.1 or 1.0.0) on Databricks Runtime 9.1

You can use either the prepackaged docker container, projectglow/databricks-glow:9.1

Or attach the Pypi package and maven coordinates to the cluster, see screenshots below

Screen Shot 2021-10-22 at 9 11 45 AM

Screen Shot 2021-10-22 at 9 11 57 AM

dberma15 commented 2 years ago

I was using 1.0.0, I think. I created a new cluster running 9.1 and using 1.1.0 but now I get an error on the second to last cell:


stats_df = df.groupBy("INFO_SVTYPE")\
  .agg(expr("""aggregate_by_index(
    genotypes,
    0,
    (nonref, g) -> if(exists(g.calls, call -> call != -1 and call != 0), nonref + 1, nonref),
    (nonref1, nonref2) -> nonref1 + nonref2) as count_non_ref"""))
display(stats_df)
AnalysisException: Invalid call to dataType on unresolved object, tree: 'if('exists(lambda 'g.calls, lambdafunction((NOT (lambda 'call = -1) AND NOT (lambda 'call = 0)), lambda 'call, false)), (lambda 'nonref + 1), lambda 'nonref)
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<command-3319344438369073> in <module>
----> 1 stats_df = df.groupBy("INFO_SVTYPE")\
      2   .agg(expr("""aggregate_by_index(
      3     genotypes,
      4     0,
      5     (nonref, g) -> if(exists(g.calls, call -> call != -1 and call != 0), nonref + 1, nonref),

/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
    116             # Columns
    117             assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
--> 118             jdf = self._jgd.agg(exprs[0]._jc,
    119                                 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
    120         return DataFrame(jdf, self.sql_ctx)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    121                 # Hide where the exception came from that shows a non-Pythonic
    122                 # JVM exception message.
--> 123                 raise converted from None
    124             else:
    125                 raise

AnalysisException: Invalid call to dataType on unresolved object, tree: 'if('exists(lambda 'g.calls, lambdafunction((NOT (lambda 'call = -1) AND NOT (lambda 'call = 0)), lambda 'call, false)), (lambda 'nonref + 1), lambda 'nonref)
williambrandler commented 2 years ago

hey @dberma15 this is a private function in spark that has been deleted from databricks runtime with Spark 3.1. I have a ticket open with engineering tracking this

So for now you cannot use this cell, but everything else in the notebook examples do work (as per nightly testing of glow)

williambrandler commented 2 years ago

I removed the offending cell and the notebooks should work now with the latest version of Glow,

Please download it from here,

https://github.com/projectglow/glow/blob/master/docs/source/_static/notebooks/etl/sample-qc-demo.html