tyler-tomita / RandomerForest

Discriminant Projection Forest results, datasets, etc.
44 stars 21 forks source link

Response to reviewers #81

Closed tyler-tomita closed 8 years ago

tyler-tomita commented 8 years ago
tyler-tomita commented 8 years ago

Revised here

jovo commented 8 years ago

Dear Reviewers,

We graciously thank you for your feedback and criticisms. We have carefully taken your comments into consideration and would like to respond to some of the major concerns.

In our view, our contribution is two-fold. First, it is a re-analysis and re-interpretation of oblique decision forests, including for example Breiman's Forest-RC (FRC). Second, by virtue of this improved perspective, we provide a number of novel additional advancements. We provide more details on each of the above two points below.

  1. When Breiman introduced FRC in his seminal paper, he concluded: "Overall, it compares more favorably to Adaboost than Forest-RI (FRI)." And yet, it has been his FRI (the axis-aligned) counterpart, that has been lauded. In particular, two recent studies (Delgado 2014; Caruana 2008) that found FRI to be the overall best performing classification method among a variety of other methods on a variety of benchmark datasets did not include FRC in the comparisons. We conjecture that one of the main reasons for people to focus on FRI rather than FRC is because FRC has an additional hyperparameter to tune, which makes FRC computational several fold less tractable than FRI. We therefore formulated a variant with similar performance to FRC, but with only 1 parameter as in FRI, therefore achieving the best of both worlds.

    We have since conducted extensive experiments that demonstrate that indeed, our RerF and FRC have similar performance properties, though RerF is several-fold faster to tune. We will include both these accuracy and timing results in the revision.

  2. A question remains as to why FRC & RerF outperform their axis-aligned counterpart FRI. We conjecture that this is because RerF & FRC, at each node of each tree, generate a random matrix that satisfies the theoretical conditions under which a variety of random projection and sketching theoretical results hold, which is not true for FRI. Because those theoretical advancements are more recent than Breiman's original proposal, he never made this connection, and to our knowledge, neither has anybody else. So, this is the first theoretically grounded explanation of the performance of FRC over FIC. We will clarify this important point in the revision.
  3. Although FRC & RerF outperform FRI under certain assumptions, it is clear than any axis-oblique method will lose one of the most appealing properties of FRI: unit and scale invariance. While Breiman did not propose an approach for mitigating this problem, we have proposed converting to ranks as a pre-processing step.

    In the revision, we have modified and extended the experiments transforming the data, in particular with regard to scale, to point out that our RerF(rank) is significantly more robust to these transformations than RerF and FRC.

  4. Breiman seemed to have viewed FRC as a variant of FRI, and perhaps therefore only considered d (the number of features to generate per node) to be relatively small, and in particular, always much less than p (the number of observed features). While this is a logical constraint for FRI, FRC and RerF have no such limitation. Therefore, we have conducted extensive experiments letting d be significantly larger than p (up to p^2), for the first time to our knowledge. Indeed, we have discovered that for many settings, d > p even further improves performance of FRC & RerF over FRI. We will include these additional experiments in the revision.
  5. Any estimation in high-dimensional settings begs for an analysis vis-a-vis bias & variance. For decision forests, we cast this as an analysis of weak learner stength vs. diversity. We have therefore conducted experiments to investigate the trade-off between these two factors, and have preliminary results suggesting that as d increases, the strength of weak learners can continue to improve, even though their diversity is decreasing. This suggests that the reason RerF & RF improve over FRI is because of increased strength of the weak learners, rather than increased diversity. In the revision, we will include these experiments and discuss potential avenues for future exploration.
  6. Finally, the reviewers point out to a number of short comings in the text itself, with regard to random projection description, etc. We will clarify all of those points.

mine is a bit too long, so maybe you can shorten? i like mine better, but i'm not convinced it is. one point: you make comments that will be of general interest in response to 1 of the reviewers, but they should be made to everyone perhaps?