Slow performance splitting on large categorical variables.

ryanbressler commented 10 years ago

BestSplitBigCatIter is too slow. Possibly investigate using fully randomized search earlier or "Stochastic Greedy Algorithms: A leaning based approach to combinatorial optimization" [1] as in rf-ace.

[1] http://www.thinkmind.org/index.php?view=article&articleid=soft_v4_n12_2011_1

rbkreisberg commented 10 years ago

do you mean categorical variables with many samples or large cardinality?

On Wed, Apr 30, 2014 at 7:35 AM, Ryan Bressler notifications@github.comwrote:

BestSplitBigCatIter is too slow. Possibly investigate using fully randomized search earlier or "Stochastic Greedy Algorithms: A leaning based approach to combinatorial optimization" [1] as in rf-ace.

[1] http://www.thinkmind.org/index.php?view=article&articleid=soft_v4_n12_2011_1

— Reply to this email directly or view it on GitHubhttps://github.com/ryanbressler/CloudForest/issues/38 .

ryanbressler commented 10 years ago

I was unclear but it is actually both, a user provided me with a data set with 200k+ samples mostly high cardinality features. I attached a pprof screen shot though I used -nCores 1 -nTrees 1 -mTry 1 -leafSize 1000 to get it to finish one tree in ~ a minute. The iterative search for high cardinality features is using the vast majority of the time and most of that doing bitchecking.

I need to see if bit packing big ints is actually the most efficient strategy to use here (maybe preallocate a bool array instead?) and also if there is a way to avoid repeatedly scanning the feature (maybe generate a random label order, sort the cases and move labels from left to right as if it was a numerical feature?).

Email posting didn't include attachments. Here they are: pprof-WebList

Not sure why the markdown won't behave, click on the links i guess.

AndrewHannigan commented 8 years ago

Potentially useful theorem which lets you perform a split on a categorical variable with L categories by only valuating L-1 (instead 2^L - 1) partitions. Theorem 3.4 on page 51: https://arxiv.org/pdf/1407.7502.pdf

ryanbressler / CloudForest

Slow performance splitting on large categorical variables. #38