marcell-ferencz-databricks / dash-hackathon-0424

0 stars 0 forks source link

NY Flood/03a and 3b Index and Join #4

Closed tomkdefra closed 7 months ago

tomkdefra commented 7 months ago

Looks like all the dataframe joins are stacking up 'road_length' columns. Can we rename them or is there a cleverer way?

features_df = features_df.join(
  roads_df,
  on="cell_id",
  how="left"
)

features_df = features_df.groupby("cell_id").agg(
  F.mean("target").alias("target"),
  F.mean("comppct_l").alias("comppct_l"),
  F.mean("comppct_r").alias("comppct_r"),
  F.mean("comppct_h").alias("comppct_h"),
  F.mean("slope_l").alias("slope_l"),
  F.mean("slope_r").alias("slope_r"),
  F.mean("slope_h").alias("slope_h"),
  F.mean("airtempa_l").alias("airtempa_l"),
  F.mean("airtempa_r").alias("airtempa_r"),
  F.mean("airtempa_h").alias("airtempa_h"),
  F.mean("road_length").alias("road_length")
)

AnalysisException: [AMBIGUOUS_REFERENCE] Reference road_length is ambiguous, could be: [road_length, road_length, road_length, road_length].

marcell-ferencz-databricks commented 7 months ago

@tomkdefra this may be happening if you're re-running the cell with the joins multiple times (you're recursively joining the dataframe onto itself).

The solution could be to run the cells from the point of the first definition of features_df or to use different variables for each new definition of the dataframe, e.g. features_with_roads_df