kraina-ai / srai

Spatial Representations for Artificial Intelligence - a Python library toolkit for geospatial machine learning focused on creating embeddings for downstream tasks
https://kraina-ai.github.io/srai/
Apache License 2.0
220 stars 17 forks source link

Add spatial stratification algorithm for splitting datasets into training and testing #433

Open RaczeQ opened 7 months ago

RaczeQ commented 7 months ago

Add an algorithm for splitting the dataset based on spatial location instead of random sampling.

RaczeQ commented 7 months ago

Sources:

sabman commented 7 months ago

@RaczeQ Thanks for creating this issue. I'd like to see if I can contribute. I am assuming this is in reference to the training loop for the embedding models? If so can you also reference the code module where this might be used. I am guessing its this https://github.com/kraina-ai/srai/blob/3e7a787f69e43835fb117fc5d9e21bd9b7050620/srai/embedders/hex2vec/neighbour_dataset.py#L154

RaczeQ commented 7 months ago

Hello @sabman, thank you for showing interest in expanding the library 😊

I've created this issue specifically with end-tasks in mind, and I was planning on leaving the embedding models training (hex2vec, geovex etc) without changes - those will still be fitted on the whole provided dataset.

However, after you've mentioned this, I can see the potential use case in combination with existing embedder just for benchmarking purposes:

  1. Prepare regions / features geodataframes.
  2. Split them into training and validation data.
  3. Train embedder on training data.
  4. Transform validation data (with both encoder and decoder) and calculate the loss between the decoded and original values.

Currently we don't have any specific examples with downstream tasks in the documentation, there is one in our dedicated tutorial repository (https://github.com/kraina-ai/srai-tutorial). I think about this functionality as a future utility for taking a given geodataframe and assigning a stratification class based on a geometry (or a more sophisticated scenario with class column AND geometry).

My previous comment is the list of materials I've gathered about this topic and if there is a good out of the shelf solution for this use case - we can just add it as a dependency and wrap it within srai API. If you have more ideas, examples or sources about it - I'd be thankful for sharing 🙇🏻.

RaczeQ commented 7 months ago
# just pseudo-coding here
def spatial_stratification(
    regions_gdf: GeoDataFrame,
    no_output_classes: int = 2,
    split_values: Optional[list[float]] = None,
    class_column: Optional[str] = None,
) -> pd.Series:
    """
    Generates a Pandas Series with stratification class value and an index from provided GeoDataFrame.

    Args:
        regions_gdf (gpd.GeoDataFrame): The regions that are being stratified.
        no_output_classes (int, optional): How many classes should be in the result series.
            Defaults to 2.
        split_values (Optional[list[float]], optional): The fraction between classes. When not provided,
            rows will be stratified equally. Defaults to None.
        class_column (Optional[str], optional): Name of the column used to additionally take into
            consideration when stratifying geometries. Defaults to None.
    """
    if no_output_classes < 1:
        raise ValueError("Number of output classes should be positive.")

    if not split_values:
        split_values = [1/no_output_classes for _ in range(no_output_classes)]

    normalized_split_values = [
        split_value / sum(split_values) for split_value in split_values
    ] # normalize to 1
    ...