Open ryanbressler opened 10 years ago
do you mean categorical variables with many samples or large cardinality?
On Wed, Apr 30, 2014 at 7:35 AM, Ryan Bressler notifications@github.comwrote:
BestSplitBigCatIter is too slow. Possibly investigate using fully randomized search earlier or "Stochastic Greedy Algorithms: A leaning based approach to combinatorial optimization" [1] as in rf-ace.
[1] http://www.thinkmind.org/index.php?view=article&articleid=soft_v4_n12_2011_1
— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/38 .
I was unclear but it is actually both, a user provided me with a data set with 200k+ samples mostly high cardinality features. I attached a pprof screen shot though I used -nCores 1 -nTrees 1 -mTry 1 -leafSize 1000 to get it to finish one tree in ~ a minute. The iterative search for high cardinality features is using the vast majority of the time and most of that doing bitchecking.
I need to see if bit packing big ints is actually the most efficient strategy to use here (maybe preallocate a bool array instead?) and also if there is a way to avoid repeatedly scanning the feature (maybe generate a random label order, sort the cases and move labels from left to right as if it was a numerical feature?).
Email posting didn't include attachments. Here they are:
Not sure why the markdown won't behave, click on the links i guess.
Potentially useful theorem which lets you perform a split on a categorical variable with L categories by only valuating L-1 (instead 2^L - 1) partitions. Theorem 3.4 on page 51: https://arxiv.org/pdf/1407.7502.pdf
BestSplitBigCatIter is too slow. Possibly investigate using fully randomized search earlier or "Stochastic Greedy Algorithms: A leaning based approach to combinatorial optimization" [1] as in rf-ace.
[1] http://www.thinkmind.org/index.php?view=article&articleid=soft_v4_n12_2011_1