Decision tree with data pulling.

hugobuddel / orange3

Orange fork to add data pulling

Other

0 stars 0 forks source link

We need a new decision tree widget which takes advantage of the LazyTable and data pulling. Orange 3 uses scikit-learn for most of it's classification algorithms, but this is not appropriate for us as scikit-learn does not operate on such 'infinite' data sources.

We will implement our own decision tree learner in Python and have it operating directly on LazyTable and/or related classes. This can be a multistep process:

Implement a standard decision tree learner operating of fixed-size Orange tables. This serves to test the algorithms related to entropy calculation and split-point selection.
Implement a standard decision tree classifier. This takes the tree from the previous step and applies it to new data.
Extend the classifier to use the LazyTable. This means it should only request the attributes which are being used for classification, and it should only request them for the instances which need to use them. Some of this should come for free from the LazyTable, but it will be the first demonstration of something new.
Extend the learner to use the LazyTable. This means it should not consider every attribute to split the dataset, but should instead get an estimate of how much each attribute costs to compute so that it can split based on cheaper attributes first.
Consider the use of incremental decision tree construction. The idea here is that the tree can evolve over time as new data becomes available. This seems pretty powerful and would tie in really nicely with the concepts in our project, but I don't know how complex it is.
There is currently no decision tree visualization in Orange 3. For practical purposes it would be useful to have one, and we should consider if we can do something novel here.

Any further thoughts?

Clear terminology, classifier and learner.

Extending the classifier to use the LazyTable seems a bit weird for a decision tree. Since the decision algorithm will be a linear selection on the attributes, the only thing necessary to apply the classifier would be to request all the objects matching the criterion that is constructed from the whole tree. This would effectively run the classifier in the database as an SQL query.

Alternatively, we could assume that each step in the classifier is very complex and has to be done in Orange. This would match the idea of a classifier network. Then we could iterate over all the objects (rows), and request (difficult-to-calculate) attributes as we need them. However, requesting attributes per individual object is costly so we want to group objects together. However, creating such a grouping is only easy when the classification is easy, which violates our assumption. Perhaps we need to think this through some more.

Similar problems arise for the learner. But perhaps we can just see how far we can go. If we can somehow guarantee that subsets are representative of the whole table, then we can probably automatically 'group' objects to get efficient calculations and data transfers. Furthermore, for the InfiniTable it wouldn't really matter I think, so we can already develop the lazy learner and lazy classifier using the InfiniTable (or perhaps LazyFile).

The incremental decision tree seems to be potentially very powerful.

hugobuddel / orange3

Decision tree with data pulling. #14