Open RaczeQ opened 7 months ago
Sources:
@RaczeQ Thanks for creating this issue. I'd like to see if I can contribute. I am assuming this is in reference to the training loop for the embedding models? If so can you also reference the code module where this might be used. I am guessing its this https://github.com/kraina-ai/srai/blob/3e7a787f69e43835fb117fc5d9e21bd9b7050620/srai/embedders/hex2vec/neighbour_dataset.py#L154
Hello @sabman, thank you for showing interest in expanding the library 😊
I've created this issue specifically with end-tasks in mind, and I was planning on leaving the embedding models training (hex2vec, geovex etc) without changes - those will still be fitted on the whole provided dataset.
However, after you've mentioned this, I can see the potential use case in combination with existing embedder just for benchmarking purposes:
Currently we don't have any specific examples with downstream tasks in the documentation, there is one in our dedicated tutorial repository (https://github.com/kraina-ai/srai-tutorial). I think about this functionality as a future utility for taking a given geodataframe and assigning a stratification class based on a geometry (or a more sophisticated scenario with class column AND geometry).
My previous comment is the list of materials I've gathered about this topic and if there is a good out of the shelf solution for this use case - we can just add it as a dependency and wrap it within srai
API. If you have more ideas, examples or sources about it - I'd be thankful for sharing 🙇🏻.
# just pseudo-coding here
def spatial_stratification(
regions_gdf: GeoDataFrame,
no_output_classes: int = 2,
split_values: Optional[list[float]] = None,
class_column: Optional[str] = None,
) -> pd.Series:
"""
Generates a Pandas Series with stratification class value and an index from provided GeoDataFrame.
Args:
regions_gdf (gpd.GeoDataFrame): The regions that are being stratified.
no_output_classes (int, optional): How many classes should be in the result series.
Defaults to 2.
split_values (Optional[list[float]], optional): The fraction between classes. When not provided,
rows will be stratified equally. Defaults to None.
class_column (Optional[str], optional): Name of the column used to additionally take into
consideration when stratifying geometries. Defaults to None.
"""
if no_output_classes < 1:
raise ValueError("Number of output classes should be positive.")
if not split_values:
split_values = [1/no_output_classes for _ in range(no_output_classes)]
normalized_split_values = [
split_value / sum(split_values) for split_value in split_values
] # normalize to 1
...
Add an algorithm for splitting the dataset based on spatial location instead of random sampling.