kraina-ai / srai

Spatial Representations for Artificial Intelligence - a Python library toolkit for geospatial machine learning focused on creating embeddings for downstream tasks
https://kraina-ai.github.io/srai/
Apache License 2.0
220 stars 17 forks source link

H3Neighbourhood undeterminism #441

Open Calychas opened 7 months ago

Calychas commented 7 months ago

After changes in #436 the H3Neighbourhood became undeterministic. The underlying library (h3py) returns from 4.0.0.b3 - https://github.com/uber/h3-py/pull/339 neighbours in random order. That means that downstream models cannot be forced to return the same results between different sessions.

Potential solutions:

  1. sort values from neighbourhoods inside the models - easiest for now, but that needs to be remembered across the models
  2. change interface and logic of neighbourhoods to return sorted results (probably list instead of set) - preferred, one fix and done

e.g. a solution for 1. for Hex2VecEmbedder can look like that

def _build_lookup_tables(self, data: pd.DataFrame, neighbourhood: Neighbourhood[T]) -> None:
  anchor_df_locs_lookup: list[int] = []
  positive_df_locs_lookup: list[int] = []

  for region_df_loc, region_index in tqdm(enumerate(data.index), total=len(data)):
      region_direct_neighbours = sorted(neighbourhood.get_neighbours(region_index))
      neighbours_df_locs = {
          self._region_index_to_df_loc[neighbour_index]
          for neighbour_index in region_direct_neighbours
      }
      anchor_df_locs_lookup.extend([region_df_loc] * len(neighbours_df_locs))
      positive_df_locs_lookup.extend(neighbours_df_locs)

      indices_excluded_from_negatives = sorted(neighbourhood.get_neighbours_up_to_distance(
          region_index, self._negative_sample_k_distance
      ))
      self._excluded_from_negatives[region_df_loc] = {
          self._region_index_to_df_loc[excluded_index]
          for excluded_index in indices_excluded_from_negatives
      }

  self._anchor_df_locs_lookup = np.array(anchor_df_locs_lookup)
  self._positive_df_locs_lookup = np.array(positive_df_locs_lookup)