projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

Error while calculating call_summary_stats #429

Closed jatin-sandhuria closed 2 years ago

jatin-sandhuria commented 2 years ago

samples_filtered_genotype_df.describe() Out[81]: DataFrame[summary: string, contigName: string, start: string, end: string, referenceAllele: string, INFO_variant_id: string, INFO_rsq: string, INFO_chromosome: string, INFO_new_variant_id: string, INFO_AN: string, INFO_call_rate: string, INFO_min_af: string, INFO_min_ac: string]

genotype_df_call_stats = samples_filtered_genotype_df.select("*", glow.expand_struct(glow.call_summary_stats("genotypes"))) java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V

/databricks/spark/python/pyspark/sql/dataframe.py in select(self, cols) 1690 [Row(name='Alice', age=12), Row(name='Bob', age=15)] 1691 """ -> 1692 jdf = self._jdf.select(self._jcols(cols)) 1693 return DataFrame(jdf, self.sql_ctx) 1694

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, kw) 115 def deco(*a, *kw): 116 try: --> 117 return f(a, kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o9520.select. : java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V at io.projectglow.sql.expressions.ExpandStruct.$anonfun$expand$1(glueExpressions.scala:45) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at io.projectglow.sql.expressions.ExpandStruct.expand(glueExpressions.scala:43) at io.projectglow.sql.optimizer.ResolveExpandStructRule$.$anonfun$expandExprs$1(hlsOptimizerRules.scala:83) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at io.projectglow.sql.optimizer.ResolveExpandStructRule$.io$projectglow$sql$optimizer$ResolveExpandStructRule$$expandExprs(hlsOptimizerRules.scala:81) at io.projectglow.sql.optimizer.ResolveExpandStructRule$$anonfun$apply$3.applyOrElse(hlsOptimizerRules.scala:67) at io.projectglow.sql.optimizer.ResolveExpandStructRule$$anonfun$apply$3.applyOrElse(hlsOptimizerRules.scala:65) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:137) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:86) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:137) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:340) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:133) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning$(AnalysisHelper.scala:129) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUpWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:110) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:109) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:30) at io.projectglow.sql.optimizer.ResolveExpandStructRule$.apply(hlsOptimizerRules.scala:65) at io.projectglow.sql.optimizer.ResolveExpandStructRule$.apply(hlsOptimizerRules.scala:63) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$3(RuleExecutor.scala:221) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:221) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:89) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:218) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:210) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:210) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:285) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:278) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:224) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:188) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:109) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:188) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:260) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:347) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:259) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:96) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:134) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:180) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:180) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:86) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:94) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:92) at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3849) at org.apache.spark.sql.Dataset.select(Dataset.scala:1489) at sun.reflect.GeneratedMethodAccessor626.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

williambrandler commented 2 years ago

Thanks Jatin, what version of Glow are you using and what version of Spark (/ the Databricks Runtime)?

This function is tested nightly pulling from this Docker container on Databricks Runtime 9.1 (Spark 3.1.2):

projectglow/databricks-glow:9.1

jatin-sandhuria commented 2 years ago

Hi William - I am using 9.0 (includes Apache Spark 3.1.2, Scala 2.12) glow.py==1.1.0

williambrandler commented 2 years ago

Thanks, I ran this notebook: variant-qc-demo.html

Using the Docker container on Databricks Runtime 9.1 And manually installing Glow v1.1.0 on Databricks Runtime 9.0

And both worked

do you have other libraries installed on the cluster such as Hail?

williambrandler commented 2 years ago

oh also did you drop the genotypes column? Can't see it in the samples_filtered_genotype_df schema

jatin-sandhuria commented 2 years ago

apparently describe doesnt show genotype. not sure why

genotype_df = spark.read.format('delta').load(genotype_delta_path) contigName:string start:long end:long names:array element:string referenceAllele:string alternateAlleles:array element:string INFO_variant_id:string INFO_rsq:double INFO_chromosome:string INFO_new_variant_id:string INFO_AC:array INFO_AF:array element:double INFO_AN:integer INFO_homozygote_count:array INFO_call_rate:double genotypes:array element:struct sampleId:string calls:array element:integer phased:boolean posteriorProbabilities:array element:double dosage:double old_GT:string old_dosage:double

genotype_df.select("*", glow.expand_struct(glow.call_summary_stats("genotypes"))) java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V

williambrandler commented 2 years ago

ok, and do you have any other configurations different from just Glow v1.1.0 and DBR 9.0?

I ran this notebook: variant-qc-demo.html

Using the Docker container on Databricks Runtime 9.1 And manually installing Glow v1.1.0 on Databricks Runtime 9.0

And both worked

jatin-sandhuria commented 2 years ago

Hi William - We removed hail from our cluster but still the error persists.

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V

jatin-sandhuria commented 2 years ago

These are the libraries we have installed. image

Databricks runtime version: image

Spark Config: image

williambrandler commented 2 years ago

ah, there is a mismatch between the maven and pypi versions of glow please bump the maven version from 1.0.1 to 1.1.0

closing for now, please reopen if this does not solve the problem

jatin-sandhuria commented 2 years ago

Thanks William. Changing the version resolved it.