Open cosmosLukas opened 2 months ago
This is unfortunately not really the right way to implement this. The 'sampling' implementation that I assume you are using stores a random sample of data per tree per leaf. If this value is NaN this basically reduces the number of trees that your calculated value is based on.
I don't really know how sklearn implemented this, if they either ignore the entries with NaN values or they are doing something different. Depending on that, I could do this same before calling the default random forest implementation.
The same goes for the test step. Conceptually, what is the result supposed to be if a NaN is present in the data? NaN? Something else?
The current version of RandomForestRegressor supports nans and missing values in training data, however, sklearn-quantile does not.
If you edit the lines of code (e.g., in quantile.pyx) that use "check_X_y" or "check_array" from sklearn to include force_all_finite="allow-nan" and recompile, then from what I can tell, everything works fine.
However, you also need to replace "np.quantile" with "np.nanquantile" or else nans are output in some of the quantiles.
I have not done extensive testing, but in the use cases with my own data, this appears to work. Maybe someone else will benefit from this too.