scikit-learn-contrib / skope-rules

machine learning with logical rules in Python
http://skope-rules.readthedocs.io
Other
598 stars 96 forks source link

Classification vs regression #37

Open benman1 opened 4 years ago

benman1 commented 4 years ago

Hi I think this package looks fantastic. I am wondering, however, what your plans are for implementing SkopeRules for regression. Are there any plans?

I've made a start for adding regression, and I had to make a lot of changes. I made this up as I went through the code really. I had to come up with measures comparable to precision and recall - the precision-like measure is based on the expected reduction in standard deviation; the recall-like measure is based on the z-score of the prediction versus the population of y. At the end, scores are integrated via softmax weighted rules. At the moment, I still get a lot of nans in predictions, because there are not enough rules. The overall mse error is still much worse than a baseline from linear regression.

I've also added comments and a test for regression. This is WIP, but I am happy for anyone to jump in.

Thanks!

benman1 commented 4 years ago

After a more testing it seems that for the diabetes dataset that I am using for benchmarking, the linear model actually outperforms the random forest regressor and the decision tree regressor (the latter by a lot); therefore I might have been a bit too strict judging the performance I was getting. I am now getting a performance very similar to both the random forest and linear models, although without rule filtering and without deduplication.

wjj5881005 commented 3 years ago

I think the oob score computed in the fit function is wrong.

The authors get the oob samples by "mask = ~samples", and then apply X[mask, :] to get the oob samples. Actually, I test the case and found that there are many same elements between samples and X[mask,:]。

I also turn to the implemtion of oob of randomforest, and I found following codes:

random_instance = check_random_state(random_state) sample_indices = random_instance.randint(0, samples, max_samples) sample_counts = np.bincount(sample_indices, minlength=len(samples)) unsampled_mask = sample_counts == 0 indices_range = np.arange(len(samples)) unsampled_indices = indices_range[unsampled_mask]

then the unsampled_indices is the truely oob sample indices.