jasperroebroek / sklearn-quantile

BSD 3-Clause "New" or "Revised" License
18 stars 1 forks source link

Request to add functionality for training / test inputs that contain NaN #11

Open cosmosLukas opened 2 months ago

cosmosLukas commented 2 months ago

The current version of RandomForestRegressor supports nans and missing values in training data, however, sklearn-quantile does not.

If you edit the lines of code (e.g., in quantile.pyx) that use "check_X_y" or "check_array" from sklearn to include force_all_finite="allow-nan" and recompile, then from what I can tell, everything works fine.

However, you also need to replace "np.quantile" with "np.nanquantile" or else nans are output in some of the quantiles.

I have not done extensive testing, but in the use cases with my own data, this appears to work. Maybe someone else will benefit from this too.

jasperroebroek commented 2 months ago

This is unfortunately not really the right way to implement this. The 'sampling' implementation that I assume you are using stores a random sample of data per tree per leaf. If this value is NaN this basically reduces the number of trees that your calculated value is based on.

I don't really know how sklearn implemented this, if they either ignore the entries with NaN values or they are doing something different. Depending on that, I could do this same before calling the default random forest implementation.

The same goes for the test step. Conceptually, what is the result supposed to be if a NaN is present in the data? NaN? Something else?