suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

autoCompress grows only stumps on binary data #48

Closed mnwright closed 4 years ago

mnwright commented 4 years ago

With the default value for autoCompress, the R package grows only tree stumps when x-data is binary. Example:


library(Rborist)

# Simulate data
n <- 100
p <- 4
x <- replicate(p, rbinom(n, 1, .5))
y <- factor(rbinom(n, 1, .5))

# Default autoCompress -> Less than 3 nodes per tree on average
rf <- Rborist(x = x, y = y, nTree = 100, minInfo = 1e-20)
rb <- Export(rf)
length(rb$tree[[length(rb$tree)]]$internal$split)/length(rb$tree)

# Disable autoCompress -> Expected behaviour (several splits per tree)
rf <- Rborist(x = x, y = y, nTree = 100, minInfo = 1e-20, autoCompress = 1)
rb <- Export(rf)
length(rb$tree[[length(rb$tree)]]$internal$split)/length(rb$tree)
suiji commented 4 years ago

Thank you for catching this - and for using the Export() command.

suiji commented 4 years ago

This appears to be repaired in 0.2-3.

Training with and without autocompression should yield nearly identical results. Differences, if there are any, should all be attributable to the ordering of floating point operations.

Leaving this open for further testing.

mnwright commented 4 years ago

Thanks, seems to be working.

suiji commented 4 years ago

Thank you for confirming.

We see increasing divergence between the two training modes as the number of categories increases. Hawk does not feel that this should be attributable solely to differences in floating point accumulation, so we need to take a deeper look before closing the Issue.

suiji commented 4 years ago

In the case of regression, at least, differences in accuracy between training with and without autocompression appear to be due to floating point accumulation. In particular, neither regime is uniformly superior to the other.

Thank you for the very useful test case.

Please feel free to reopen should the issue re-emerge.