projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

aggregate_by_index fails with "Invalid call to dataType on unresolved object" in spark 3.1 #422

Closed williambrandler closed 2 years ago

williambrandler commented 2 years ago

when using aggregate_by_index to compute aggregation array in Spark 3.1, Glow 1.1.0, the code fails with

stats_df = df.groupBy("INFO_SVTYPE")\ .agg(expr("""aggregate_by_index( genotypes, 0, (nonref, g) -> if(exists(g.calls, call -> call != -1 and call != 0), nonref + 1, nonref), (nonref1, nonref2) -> nonref1 + nonref2) as count_non_ref""")) display(stats_df)

AnalysisException: Invalid call to dataType on unresolved object, tree: 'if('exists(lambda 'g.calls, lambdafunction((NOT (lambda 'call = -1) AND NOT (lambda 'call = 0)), lambda 'call, false)), (lambda 'nonref + 1), lambda 'nonref)

Any ideas on how to resolve @henrydavidge , should we delete this example or rewrite it with higher order functions?

williambrandler commented 2 years ago

seems like this is related to Databricks runtime and not to open source spark, so closing