sanity / quickml

A fast and easy to use decision tree learner in java
http://quickml.org/
GNU Lesser General Public License v3.0
231 stars 54 forks source link

Higher computation time for new versions #67

Open ADiegoCAlonso opened 9 years ago

ADiegoCAlonso commented 9 years ago

I am comparing two versions of quickdt, namely:

In the following plot, we have the old version (0.0.8.11) in blue and a new one (0.2.2.1) in red. In the old version, the computation time is linear w.r.t. the number of samples; but in the newer version, the time grows exponentially.

computation_time

Any idea that justifies this result?

sanity commented 9 years ago

@athawk81 Any ideas?

florianlaws commented 9 years ago

I am a colleague of Diego and would like to follow up on the issue. It seems that this is probably not a memory usage issue. I created a sampling profile using jvisualvm and it looks like the code is spending a lot of its time in TreeBuilder.createNClassCategoricalNode(). We're using quite a few features that correspond to words in a text, i.e. categorical features with a lot of levels. I wonder if there is some inefficiencies inside that method. Note that this is still on version 0.2.2.1. I'll try to reproduce that using the most current version when I have the time. screen shot 2015-06-08 at 12 10 08

sanity commented 9 years ago

Florian, I would definitely encourage you to try again with the latest version, as there have been many improvements. If since version 0.2.2.1, including efficiency improvements.

If you still have a problem please let us know.

Kind regards,

Ian.

On Mon, Jun 8, 2015 at 5:10 AM, Florian Laws notifications@github.com wrote:

I am a colleague of Diego and would like to follow up on the issue. It seems that this is probably not a memory usage issue. I created a sampling profile using jvisualvm and it looks like the code is spending a lot of its time in TreeBuilder.createNClassCategoricalNode(). We're using quite a few features that correspond to words in a text, i.e. categorical features with a lot of levels. I wonder if there is some inefficiencies inside that method. Note that this is still on version 0.2.2.1. I'll try to reproduce that using the most current version when I have the time. [image: screen shot 2015-06-08 at 12 10 08] https://cloud.githubusercontent.com/assets/498420/8032606/4f88bb64-0dd7-11e5-9ed4-78756437ba83.png

— Reply to this email directly or view it on GitHub https://github.com/sanity/quickml/issues/67#issuecomment-109941269.

Ian Clarke Blog: http://blog.locut.us/

florianlaws commented 9 years ago

I'm not too thrilled that later versions dropped support for bagging, but we'll try.

sanity commented 9 years ago

@athawk81 can comment on why we did that, but I believe that in our testing on a variety of datasets it didn't help predictive accuracy. There have been many improvements to predictive accuracy in more recent versions, including a flexible hyper-parameter optimizer that you should try.

sanity commented 9 years ago

Ok, we discussed it, here is the situation:

There have been a number of bugfixes, some of which have probably resulted in an increase in training time, but with significant benefits to predictive accuracy.

Our usecase for QuickML is using it for 2-class classification, and so the approach it uses for this (see TreeBuilder.createTwoClassCategoricalNode()) is quite a bit more efficient than TreeBuilder.createNClassCategoricalNode(), which builds the "inset" progressively based on the impact on the split's score.

Regarding bagging, this would be easy enough to add back, but if you are in a position to contribute improvements, I might hold-off for a few days as we have a significant refactor that's likely to be merged soon.

florianlaws commented 9 years ago

Testing with quickml 0.7.14 shows that training times are still much longer than with 0.0.8.11, too long for us to be useful. So we don't really need bagging until the training time issues are solved.

sanity commented 9 years ago

@florianlaws Looks like the solution here is to reimplement createNClassCategoricalNode() in a more efficient way. Unfortunately our focus is elsewhere right now and so we don't have the resource to allocate to this, but it shouldn't be a significant amount of work (I would guess less than a day for a proficient Java programmer - although we'd need to decide on an algorithm).

Do you have anyone that could take a whack at it? We'd certainly provide whatever assistance we can.