stevenpawley / Pyspatialml

Machine learning modelling for spatial data
GNU General Public License v3.0
145 stars 29 forks source link

Categorical features for catbbost prediction #35

Closed szaman769 closed 2 years ago

szaman769 commented 3 years ago

Hi, I love your work, for geospatial machine learning projects it is an essential one. Recently I've encountered an issue which is relatively simple to solve in R, but I cannot find how to solve it with Pyspatialml. When training ml models I use categorical feautres (eg. with catboost). Would it be possible to add such functionality here? I assume, that cat_features params could've been passed to estimator without need to create new rasters for them (at least this is how it is done in R (https://rdrr.io/cran/raster/man/predict.html) using const argument. To be more precise I use it eg. for real estate prie prediction and it would be nice to be able to have a spatial prediction where I give constant values for a real estate (like floor number or submarket, etc) that are independent from space. Thanks!

stevenpawley commented 2 years ago

Incorporated your pull request - thanks for the contribution. I did make some changes; maybe these need more discussion.

The raster/terra const argument appears to be intended for adding new constant features to a raster without needing to make actual raster datasets that contain only a single value. Your request appeared to modify the value of an existing layer by setting it to a constant value?

I would think that the more common use case is where you would want to add a constant feature to the raster without needing to create an actual raster, for example adding a depth or time value as feature. For this, a list of values for each new feature should suffice, not a dict with the key-value pairs. The new constant features are added as columns to the existing data. This does means that the user does need to ensure that the order of these columns is the same as during training (unlike in R where most models lookup the columns by name, not position) but that applies to most things in sklearn in general.

I've also included this approach in all of the predict and predict_proba methods.

szaman769 commented 2 years ago

Hi, sorry for not answering earlier for all of the comments. I wanted to work on them but I got too much on my plate recently.

  1. I guess predict_proba could also use const parameter. I didn't use it however because catboost (for which I needed the feature) doesn't have the predict_proba method.
  2. "Your request appeared to modify the value of an existing layer" - frankly it can do both and as far as I remember from using it in the raster package in R it could've been used in both cases as well. I use it in case when I don't have rasters and I even cannot have rasters unless I decide to change string factor variables into categorical numeric variables (eg. market: secondary vs primary in predicting real estate price). I think having a dictionary is useful since you don't have to remember the order of all features. If you know the name you can easily assign a value (eg. median value from the training dataset) as I do in my pipeline. Still, it is your package and I'm glad you continue to develop it because it is one of a kind :) . Best wishes, Adam

pt., 19 lis 2021 o 08:02 Steven Pawley @.***> napisał(a):

Incorporated your pull request - thanks for the contribution. I did make some changes; maybe these need more discussion.

The raster/terra const argument appears to be intended for adding new constant features to a raster without needing to make actual raster datasets that contain only a single value. Your request appeared to modify the value of an existing layer by setting it to a constant value?

I would think that the more common use case is where you would want to add a constant feature to the raster without needing to create an actual raster, for example adding a depth or time value as feature. For this, a list of values for each new feature should suffice, not a dict with the key-value pairs. The new constant features are added as columns to the existing data. This does means that the user does need to ensure that the order of these columns is the same as during training (unlike in R where most models lookup the columns by name, not position) but that applies to most things in sklearn in general.

I've also included this approach in all of the predict and predict_proba methods.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stevenpawley/Pyspatialml/issues/35#issuecomment-973804370, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVXKYBQOOEKE2A6WWPNRLDUMXZA3ANCNFSM5GJGUPTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.