Open vpipkt opened 5 years ago
For reference and motivation
@vpipkt To feel more confident in the relative performance measures in those plots I'd like to see them incorporate some compute operation that is implemented in the JVM. Just creating a DataFrame in Python and then calling toPandas
may be optimized to never send data across the runtime barrier.
As of today, there doesn't seem to be any support in Arrow for UDTs.
It's too bad, since Parquet supports them, and all they'd have to do is convert the schema of the underlying UTD encoding.
This test:
def test_pandas_udf(self):
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('double', PandasUDFType.SCALAR)
def tile_mean(cells):
# `cells` is a Pandas `Series`.
return cells.apply(np.mean)
df = self.rf.select(tile_mean(rf_tile_to_array_double(self.rf.tile)).alias('pandas_udf_mean'), rf_tile_mean(self.rf.tile))
df.show(truncate=False)
Generates this result:
+------------------+------------------+
|pandas_udf_mean |rf_tile_mean(tile)|
+------------------+------------------+
|10488.786144329897|10488.786144329897|
|null |10573.227770833333|
|9672.139422680413 |9672.139422680413 |
|null |10122.969770833333|
|10606.11498969072 |10606.11498969072 |
|9912.923030927835 |9912.923030927835 |
|10305.663113402063|10305.663113402063|
|9605.92806185567 |9605.92806185567 |
+------------------+------------------+
Not entirely sure, but suspect the null
values have to do with NoData values not getting handled properly.
try np.nanmean
?
How is the performance ?
On Mon, Aug 26, 2019 at 11:01 AM Simeon H.K. Fitch notifications@github.com wrote:
This test:
def test_pandas_udf(self): from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('double', PandasUDFType.SCALAR) def tile_mean(cells): # `cells` is a Pandas `Series`. return cells.apply(np.mean) df = self.rf.select(tile_mean(rf_tile_to_array_double(self.rf.tile)).alias('pandas_udf_mean'), rf_tile_mean(self.rf.tile)) df.show(truncate=False)
Generates this result:
+------------------+------------------+ |pandas_udf_mean |rf_tile_mean(tile)|+------------------+------------------+ |10488.786144329897|10488.786144329897| |null |10573.227770833333| |9672.139422680413 |9672.139422680413 | |null |10122.969770833333| |10606.11498969072 |10606.11498969072 | |9912.923030927835 |9912.923030927835 | |10305.663113402063|10305.663113402063| |9605.92806185567 |9605.92806185567 |+------------------+------------------+
Not entirely sure, but suspect the null values have to do with NoData values not getting handled properly.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/locationtech/rasterframes/issues/216?email_source=notifications&email_token=AB3P4L7PP63ESU542YRPH73QGPV35A5CNFSM4IHFGAVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EUOGI#issuecomment-524896025, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3P4L4SQ7FVFPVWAZ2Y2TTQGPV35ANCNFSM4IHFGAVA .
Given the following correct looking code in an environment with
pyarrow
installed:Actual Result
Results in
java.lang.UnsupportedOperationException: Unsupported data type: tile
. Full stack below.Expected result
Basically we expect the same behavior as a
udf
but with the claimed performance enhancements of Pandas UDFs.Returns seomthing like
Stack trace