Investigate combined honest trees + isotonic calibration

rflperry commented 2 years ago

Background

Honest decision trees build upon conventional decision trees by splitting the samples into two sets: one for learning the decision tree structure and the other for learning the classification posterior probabilities. In practice, this provides better calibration (i.e. the estimated probabilities are closer to the true probabilities). See this paper for details.

The code and experiments for the above paper are located in a fork of ProgLearn. The minimum working code and tutorial is seen in this notebook.

This issue has been copied over from the proglearn repository, but some of the requirements there have already been met by creation of this repository.

Request

An issue was made in sklearn and the simulations and paper attracted developer interest. The paper explored the performance of honest decision forests against the traditional forest as well as two other calibration methods, sigmoid and isotonic. A developer expressed interest in the results of combining honest trees with isotonic calibration given that isotonic calibration seems to do better than just honest posteriors. The request is thus to run the simulations and cc18 experiments from the paper with the added honest + isotonic forest method to see if this combined approach gives better calibration results than either approach alone.

Proposed Workflow

As the current honest forest code and experiments lie on a fork, it may be worthwhile to first create a new repository for just the optimized honest forest code and experiments as a separate entity from proglearn. Either way, the rough workflow would be:

Verify that this honest decision tree can be used as the base estimator for the sklearn isotonic calibration just like the regular sklearn decision tree can be. This may require editing the honest tree class to conform to sklearn specific needs. This is probably the hardest step.
Rerun the overlapping Gaussian simulation using this method too and determine the results.
If the method seems promising, run on the real cc18 data experiments.

jzheng17 commented 2 years ago

I saw your email. Will work on calibration. I’m still learning about honest tress and understanding it. Looking to make my first commit this week.

jzheng17 commented 2 years ago

Dear Ronan, Your honest decision tree can indeed be used as the base estimator isotonic calibration. The behaviors are normal programmatically, but I'll verify it's doing what its doing mathematically.

I will now rerun the overlapping Gaussian simulation using this method.

Audrey

jzheng17 commented 2 years ago

Dear Ronan, I've done steps 1 and 2 from the workflow. You can see the results here: https://github.com/jzheng17/honest-forests/blob/main/honest_forests/estimators/tests/isotonic_calibration_test.ipynb Note that I haven't integrated the tests as a pull request to the main repo.

The isotonic calibrated HF works fine and performs slightly better than HF itself. However, when comparing to isotonic calibrated RF using the overlapping Gaussian methods, the curves look almost identical.

I will run the cc18 data experiments now.

Regards, Audrey

rflperry commented 2 years ago

Okay awesome, I'll try to take a look by Friday. I honestly didn't expect a huge difference, but I think the difference between HF and ISO HF is probably the most interesting thing to see in simulations. CC18 is a big set of experiments by the way, I used a computing cluster and it took a long time with many cores and in parallel. I'm not sure it's worth running it if the simulations say what they say now, a waste of energy/time/compute unless there is a big reason you think you should.

jzheng17 commented 2 years ago

Dear Ronan, I’m using Google Colab Pro to run these experiments. Would you suggest any smaller scaled experiments? Because the curves for calibrated HF and calibrated RF are so similar from the toy dataset, I’m curious of how they would look like on real world experiments.

Audrey

rflperry commented 2 years ago

You could try individual CC18 datasets. You can probably find the original csv file results on my repo and see which datasets IRF and HF did well compared to RF. But if the curves don't look different on a toy dataset, I wouldn't imagine they would look different on a real dataset. Better time would be spending thinking about reasons why they are the same/different, explaining that, and maybe test through new simulations if there is a meaningful difference.

jzheng17 commented 2 years ago

What exactly do you mean by new simulations?

rflperry commented 2 years ago

So the initial honest paper asked, how does honest forest compare to other forests and calibration methods. The overlapping gaussian simulation showed that RF wasn't calibrated well, and that honest forests improved calibration as did the other methods. The simulation was designed such that RF would fail.

The initial question posed in this github issue was "Is there a difference between Iso HF and HF and Iso". The simulation you added answers that with, no, there isn't a difference in this example. If you think there is a difference, you (1) come up with how you think they differ (2) design an experiment where one method fails and the other succeeds.

jzheng17 commented 2 years ago

Under the scheme of questions related to the overlapping phenomenon between Gaussian clusters, there is a visible difference between Iso HF and HF (and Iso RF and HF), but there isn't a noticeable difference between Iso RF and Iso HF. I see what you mean now. Do you have any suggestions on papers/readings about other common testing methods to see the performance difference related to posterior probabilities (which is what I think the paper was trying to address by proposing HF)? Thank you.

rflperry commented 2 years ago

I don't think it is an issue with the metric, I think isotonic leads to better calibration (what we showed in the HF paper). In the HF paper, I cite a Guo 2018? paper which was seminal. You can google papers and tutorials on conformal prediction too, those are related.

jzheng17 commented 2 years ago

Sorry for the previous confusion that I might have caused with my wording issues. Just to clarify the objective of the future experiments that I'm designing, are we trying to prove the superiority of Isotonic Calibrated HF over HF, or are we trying to prove the superiority of HF over RF while both are isotonically calibrated? As for the Guo paper you've mentioned, is it this one https://arxiv.org/abs/1706.04599 (On Calibration of Modern Neural Networks).

rflperry commented 2 years ago

That's the paper. My HF paper showed that HF is less accurate than RF but improves calibration. Also, it showed IRF is generally more calibrated still than HF. The question of this github issue was is the combination of HF and IRF even more calibrated than IRF (and hence more than HF). My instinct was no, your results suggest no. Future experiments would address this, but I don't think it's an interesting enough question to spend more time on honestly. There are more interesting questions regarding forest methods.

neurodata / honest-forests