Open GoogleCodeExporter opened 9 years ago
I didn't mean to submit that this a defect by the way!
Original comment by nicolas....@gmail.com
on 7 Sep 2012 at 4:00
no issues. its a defect with almost all classifiers i know of :)
the package doesn't allow you to do that.
in theory, it may or maynot be possible with randomforests or decision trees in
general.
the trees split the examples based on the tgini value and for new examples one
could always see if the examples is perfectly classified with the existing tree
then no change is required to the tree (usually) and if the examples is not
classified correctly then recreation of some branches of the tree may be done.
or one could recalculate the tgini value at each node of the tree that the
example passes through and if the value is > some threshold recreate that part
of the tree. this will require some saving of the current tgini values at the
nodes.
but then randomforests are greedy and approximated so you may be able to get
away with lots of approximation any not require any kind of retraining unless
you encounter lots of new training examples and the distribution of the data
changes.
i vaguely remember a paper in the time-series domain where in order to learn to
the drift, the authors retrain a bunch of trees, discard a bunch of trees from
the original forest and consider the new tree bunch as a forest trained on the
current set of data.
Original comment by abhirana
on 7 Sep 2012 at 6:05
Thanks for the reply!
After I posted this, I stumbled upon MATLAB's built-in ensemble classifiers and
it appears it has method to merge and append to bagged trees
(http://www.mathworks.co.uk/help/toolbox/stats/treebagger.append.html). I found
the built-in methods to be much slower than your implementation but I guess it
might be worth a try. The document is very succinct, any idea how these methods
work?
Cheers,
Nicolas
Original comment by nicolas....@gmail.com
on 7 Sep 2012 at 6:10
Another quick question: I trained very large classifiers (~10 million examples,
28 features). The memory footprint is pretty big (easily fills my 16GB of ram).
I was wondering if there were fields in the structure that I could safely
discard to reduce the size in memory?
Original comment by nicolas....@gmail.com
on 7 Sep 2012 at 6:17
cool.
looks like they are doing tree level merging to create a new forest, like the
paper i mentioned about time-series random forests. i don't think they are
removing existing trees from the forest, only adding more trees.
i could add the append/remove function if you want? something like
append/remove(forest1, forest2, forest1_tree_indices, forest2_tree_indices)
the issue then will be how to calculate the oob error (which depends on what
training example you have used and cannot be propagated if the training
examples are the same)
about the memory requirement stuff, actually right now there is no
straightforward fix to that, the issue is that the arrays are stored in a ntree
x number nodes order size and most trees are much smaller than number nodes
size and because of that memory is wasted, if the array are stored as number
nodes x ntree size then one can dynamically increase the node size if a tree
gets bigger. actually come to think about it append/remove maynot be possible
due to this memory arrangement.
i can give it a try if you cannot find a reasonable solution.
Original comment by abhirana
on 7 Sep 2012 at 6:26
Well if you could give it a try it would be great! I will evaluate the built-in
ensemble methods but my preliminary tests were not very encouraging sadly.
Original comment by nicolas....@gmail.com
on 7 Sep 2012 at 6:43
ok, let me give that a try, it will take me about a week or so to get it done.
Original comment by abhirana
on 7 Sep 2012 at 6:47
Thanks, much appreciated!
Original comment by nicolas....@gmail.com
on 17 Sep 2012 at 5:13
Abhirana,
I was wondering if you had any chance to try implementing such a feature?
Thanks!
Nicolas
Original comment by nicolas....@gmail.com
on 17 Jan 2013 at 1:43
thanks nicolas for reminding. i apologize for the delay.
i found out that it will require a major rewriting of the code. i haven't had a
chance to make major changes and test them yet. (i am about planning to defend
my phd in a month or two so thats what distracting me for the timebeing)
i'll make it a priority to get it done soon. i am guessing that for the time
being i can implement it for classification (do tell me if regression is more
important or not), is doing it for categorical data required for the time
being? and how about implementation for importance?
Original comment by abhirana
on 21 Jan 2013 at 5:19
Thanks for taking the time to work on this. I only use classification, I do not
use categorical data and importance is not critical for my work.
Good luck with the PhD defense, my turn is only a few months away :)
Original comment by nicolas....@gmail.com
on 21 Jan 2013 at 9:43
Original issue reported on code.google.com by
nicolas....@gmail.com
on 7 Sep 2012 at 3:59