stevenpawley / Pyspatialml

Machine learning modelling for spatial data
GNU General Public License v3.0
145 stars 29 forks source link

Stratified random sampling - ValueError #42

Closed CaioBertolini closed 2 years ago

CaioBertolini commented 2 years ago

Hello @stevenpawley, thank you for this nice library!

I am having an issue with the stratified random sampling

Issue:

ValueError                                Traceback (most recent call last)
Input In [2], in <cell line: 5>()
      3 stack = Raster(predictors)
      5 with rasterio.open(nc.strata) as strata:
----> 7     df_strata = stack.sample(size=100, strata=strata, random_state=1)
      8     df_strata = df_strata.dropna()

File ~\.conda\envs\srtest\lib\site-packages\pyspatialml\raster.py:2059, in Raster.sample(self, size, strata, return_array, random_state)
   2056     valid_coordinates = np.column_stack((x, y))
   2058     # extract data
-> 2059     valid_samples = self.extract_xy(valid_coordinates)
   2061 # return as geopandas array as default (or numpy arrays)
   2062 if return_array is False:

File ~\.conda\envs\srtest\lib\site-packages\pyspatialml\raster.py:2112, in Raster.extract_xy(self, xys, return_array, progress)
   2105 for i, (layer, pbar) in enumerate(
   2106         zip(self.iloc,
   2107             tqdm(self.iloc, total=self.count, disable=not progress))
   2108 ):
   2109     sampler = sample_gen(
   2110         dataset=layer.ds, xy=xys, indexes=layer.bidx, masked=True
   2111     )
-> 2112     v = np.ma.asarray([i for i in sampler])
   2113     X[:, i] = v.flatten()
...
---> 61     pts = zip(*filter(None, pts))
     63     for row_off, col_off in zip(*rowcol(dt, *pts)):
     64         if row_off < 0 or col_off < 0 or row_off >= height or col_off >= width:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Code:

predictors = [nc.band1, nc.band2, nc.band3, nc.band4, nc.band5, nc.band7]
stack = Raster(predictors)
with rasterio.open(nc.strata) as strata:
    df_strata = stack.sample(size=100, strata=strata, random_state=1)
    df_strata = df_strata.dropna()
stevenpawley commented 2 years ago

The sample method required some TLC. I've updated the GitHub version and to be consistent with the other methods, the strata argument now should have another Raster object passed to it. So the syntax would be:

import pyspatialml.datasets.nc as nc
from pyspatialml import Raster

predictors = [nc.band1, nc.band2, nc.band3, nc.band4, nc.band5, nc.band7]
stack = Raster(predictors)
strata = Raster(nc.strata)

# return arrays
X, xy = stack.sample(size=100, strata=strata, return_array=True)

# return dataframe
samples = stack.sample(size=100, strata=strata)

Please test and see if that helps.

CaioBertolini commented 2 years ago

Thank you for your help!! I tested the stratified random sampling after your update and it worked.