projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
265 stars 111 forks source link

mean_substitute minimal example in API fails with Invalid call to dataType on unresolved object, tree: lambda '3[sum] #369

Open williambrandler opened 3 years ago

williambrandler commented 3 years ago

The mean_substitute function works on real data, but the minimal example in the API fails,

df = spark.createDataFrame([Row(unsubstituted_values=[0, 1, 2, 3, -1, None])])
df.select(glow.mean_substitute('unsubstituted_values').alias('substituted_values')).collect()

AnalysisException: Invalid call to dataType on unresolved object, tree: lambda '3[sum]

`--------------------------------------------------------------------------- AnalysisException Traceback (most recent call last)

in 1 df = spark.createDataFrame([Row(unsubstituted_values=[float('nan'), None, 0.0, 1.0, 2.0, 3.0, 4.0])]) ----> 2 df.select(glow.mean_substitute('unsubstituted_values', lit(0.0)).alias('substituted_values')).collect() /databricks/spark/python/pyspark/sql/dataframe.py in select(self, *cols) 1437 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)] 1438 """ -> 1439 jdf = self._jdf.select(self._jcols(*cols)) 1440 return DataFrame(jdf, self.sql_ctx) 1441 /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 131 # Hide where the exception came from that shows a non-Pythonic 132 # JVM exception message. --> 133 raise_from(converted) 134 else: 135 raise /databricks/spark/python/pyspark/sql/utils.py in raise_from(e) AnalysisException: Invalid call to dataType on unresolved object, tree: lambda '3[sum]`