stevenpawley / Pyspatialml

Machine learning modelling for spatial data
GNU General Public License v3.0
145 stars 29 forks source link

Mask missing values from evaluation matrix #31

Closed Itunuadedeji closed 3 years ago

Itunuadedeji commented 3 years ago

Hello Stephen,

I am currently working on fitting machine learning models to spatial data where my predictors are in raster format data, and response is a point geometry (in-stream concentration). Since the point data is not available everywhere, and also the raster file comes with a lot of empty grids not included in the watershed boundary. Is there a way to mask the NA values and empty grids (-9999 values) from the machine learning evaluation matrices so that I only have an evaluation (R2, RMSE etc.) for only available observations? My professor thinks I can do this by reaching out to the package creators and modifying the source code.

stevenpawley commented 3 years ago

Hello, I'm not sure if I am completely understanding what you are looking to achieve, but the predict method should usually not output predictions for pixels where one or more RasterLayer's in the Raster contain missing values, i.e. a single missing value in one of the grids will cause that pixel to be omitted from the prediction, even if some of the other layers do contain values. That assumes that the nodata values are set correctly for each of the input datasets, i.e. rasterio recognizes your -9999 as a nodata value.

Is this what you mean, because currently you are getting predictions for pixels that contain -9999 pixel values that should represent nodata values?

Itunuadedeji commented 3 years ago

This is good information. Presently, I have not been able to fit a ML model to a raster stack (coerced into a data frame) containing NA values. I get an error saying NA values present.

What I want specifically is to fit a ML model to a raster stack. My response is a rasterized point feature dataset with few observations, meaning over 90% of the cells are missing. I want the ML algorithm to bypass the missing values and give evaluation only for pixels with available data, then I can use the trained model to predict the other missing values.

Also, you mentioned that a single missing value will cause the pixel to be omitted. Do you mean all the aligned pixels (i.e all rasters across the stack with cell I.D [2,3]) will be omitted even if a few have values, or just that single empty cell will be omitted?

On another note, is it possible to have predictors as rasters and response as point feature format or they all have to be in the same raster format?

RichardScottOZ commented 3 years ago

Any pixel with no data will become nodata, not the entire grid.