Column name conflict when using raster_join on data that was already raster joined.

locationtech / rasterframes

Geospatial Raster support for Spark DataFrames

http://rasterframes.io

Apache License 2.0

243 stars 46 forks source link

Column name conflict when using raster_join on data that was already raster joined. #465

Open mjgolebiewski opened 4 years ago

mjgolebiewski commented 4 years ago

Py4JJavaError: An error occurred while calling o59.rasterJoin.
: org.apache.spark.sql.AnalysisException: Reference 'spatial_index_agg' is ambiguous, could be: spatial_index_agg, spatial_index_agg.;

Easy to reproduce, just try to raster_join 3 rasters. On second join error above is shown. Current solution is to df.drop('spatial_index_agg') before join.

metasim commented 4 years ago

@mjgolebiewski What would you expect the automatic behavior to be? Do you think random characters should be added? Some other mechanism?

Note: Any columns in the RHS dataframe are going to be propagated to the joined data frame as lists.

mjgolebiewski commented 4 years ago

if not random characters then maybe some related to joined dataframes names? i am still exploring raster_join and its outputs so im not sure.

metasim commented 4 years ago

@mjgolebiewski What do you mean by "joined dataframes names"? If you mean the name of the variables referencing them, then there's no way to get that information from within raster_join. My suspicion is that the behavior is typical Spark behavior, in that you have to take care of renaming columns before joins to keep them unique.

vpipkt commented 4 years ago

From a pandas user perspective and also experience with R data.frame, I would expect either:

1) All column names are appended by a distinguishing string indicating the side of the join they came from : ('_left', '_right') or ('_x', '_y'). These strings may be an argument to the join method

2) Only column names appearing in both DataFrames are disambiguated by appending in such a fashion