stevenpawley / Pyspatialml

Machine learning modelling for spatial data
GNU General Public License v3.0
145 stars 29 forks source link

addition to work with NGBoost #24

Open RichardScottOZ opened 3 years ago

RichardScottOZ commented 3 years ago

Hi Steven,

FYI, did this last year to use your work with NGBoost, finally got around to updating.

stevenpawley commented 3 years ago

Thanks for the contribution Richard - a prediction method that can output distribution will be useful, I'll merge this in the next few days.

RichardScottOZ commented 3 years ago

Yes, haven't put any sensible relevant comments/doc bits on it, as it was literally just do this so I could output.

RichardScottOZ commented 3 years ago

I saw a prediction module? Yesterday I was working on a hack for using hdbscan...are the functions in raster.py going to migrate..or version with other uses?

stevenpawley commented 3 years ago

Hi Richard,

I'm just working on a couple of problems relating to the in-memory files feature that I added to Pyspatialml, but I'd like to return to this. NGBoost looks like it uses a predict_dist method. Do you know if this works within scikit learn's structures, e.g. it can function inside a pipeline etc?

Scikit learn doesn't appear to support prediction intervals very uniformly/extensively. GradientBoostingRegressor enables prediction intervals via quantile predictions, but it does this without a new method, by setting or modifying the 'alpha' parameter of the estimator in-place, and then using the regular predict function for the specified quantile.

My favourite R random forest implementation, ranger, which there is also a Python wrapper around the C++ libs, also allows quantile prediction, but in Python it uses a predict_quantile method to perform this, so a different approach again, and so I don't think quantile predictions can be made easily if the estimator is encapsulated within another structure like a Pipeline.

RichardScottOZ commented 3 years ago

I haven't tried it, but I would guess probably? Only thing I think I remember seeing is a grid search mentioned there.

RichardScottOZ commented 3 years ago

I was wondering about that a little when I saw your apply function - e.g. if needed StandardScaler raster stack based on the original for clustering - a function and argument dictionary with the array, anything else?

stevenpawley commented 3 years ago

Yes, was wondering the same thing, if the apply method could be used for applying predictions with arbitrary/non-standard methods. I think it can, but I should work through it with an example because I'd still like to use NGBoost or skranger for prediction intervals, but when I tried with skranger it wouldn't work if wrapped inside pipelines or other methods because they don't have a predict_quantiles method to pass through.

RichardScottOZ commented 3 years ago

Yes, so possibly might need some sort of overloading custom pipeline hackery in that case, which isn't ideal.

RichardScottOZ commented 3 years ago

and hdbscan class label estimation looks like this, basically

result, result_strengths_t = hdbscan.approximate_predict(estimator, flat_pixels) (so 2 to do)

and there is #result = estimator.predict_proba(flat_pixels) result = hdbscan.prediction.membership_vector(estimator, flat_pixels) - which gives the probabilities of being in any particular cluster