sylvaticus / BetaML.jl

Beta Machine Learning Toolkit
MIT License
92 stars 14 forks source link

Trouble interpolating feature names in a wrapped tree #48

Closed ablaom closed 1 year ago

ablaom commented 1 year ago

What am I missing here?

using MLJ
import BetaML.Trees
import DataFrames as DF

table = OpenML.load(42638)
df = DF.select(DF.DataFrame(table), DF.Not(:cabin))

cleaner = FillImputer()
machc = machine(cleaner, df) |> fit!
dfc     =  transform(machc, df)

y, X = unpack(dfc, ==(:survived))

Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree(max_depth=3)
mach = machine(tree, X, y) |> fit!

raw_tree = fitted_params(mach).fitresult[1]
wrapped_tree = Trees.wrap(raw_tree, (feature_names=DF.names(X),))

# 2 == female?
# ├─ 1 == 3?
# │  ├─ "1" => 0.5
# │  │  "0" => 0.5
# │  │
# │  └─ "1" => 0.9470588235294117
# │     "0" => 0.052941176470588235
# │
# └─ 3 >= 7.0?
#    ├─ "1" => 0.16817359855334538
#    │  "0" => 0.8318264014466547
#    │
#    └─ "1" => 0.6666666666666666
#       "0" => 0.3333333333333333

cc @roland-KA

roland-KA commented 1 year ago

It seems to me, that the renaming of the named parameter for feature names of wrap which @sylvaticus introduced with commit 86a003f is causing some confusion. In DecisionTree.jl the named parameter for this purpose is called featurenames.

In BetaML it got somehow feature_names and then with the above mentioned commit features_names (additional s). But the documentation for wrap still says featurenames and in the example above feature_names is used. I.e. the InfoNode created by wrap in the example has the list of names in an attribute called feature_names, but printnode is looking for an attribute called features_names.

So we have every possible combination and a bit of a chaos 😀.

My suggestion is to go back to featurenames, in order to be consistent with 'DecsionTrees.jl' (and the documentation).

sylvaticus commented 1 year ago

Sorry, I have missed the original comment notification. I'll gonna look on this tomorrow .....

ablaom commented 1 year ago

@roland-KA Thanks for for looking into this and for the diagnosis.

sylvaticus commented 1 year ago

Thanks @ablaom for reporting and @roland-KA for the deep research of the cause of the issue. I followed your suggestion and just reset it to "featurenames". This should be in the newly released v0.9.6

roland-KA commented 1 year ago

As this issue shows, it is quite easy to run into trouble when using wrap, I'm thinking about adding a parameter check to each wrap implementation that verifies that only the keywords featurenames and classnames are used. It could throw an ArgumentError, if something is wrong.

@ablaom , @sylvaticus What's your opinion about this?

ablaom commented 1 year ago

Sound like a good idea.

ablaom commented 1 year ago

Mmm. I'm still pretty confused. Now I don't get any nice print out at all, just this:

julia> wrapped_tree = Trees.wrap(raw_tree, (featurenames=DF.names(X),))
A wrapped Decision Tree

Same if I use feature_names.

sylvaticus commented 1 year ago

I understood that the wrap function was intender for plotting only, not for printing. The decision tree is already plot in full (but without feature names) when the DecisionTreeEstimator is explicitly printed, but I may have misunderstood the needs. If there is a need to get the tree printed other than plotted, perhaps at this time it is better if I add another parameter featurenames directly in the estimator constructor... what do you think ?

roland-KA commented 1 year ago

@sylvaticus you are right, the wrap-function was intended for plotting only. But the plot recipe uses also AbstractTrees.printnode (which is implemented together which each wrap-version). And AbstractTrees.print_tree-function is based on printnode. So it is also possible to print a text-based version of the tree using print_tree.

roland-KA commented 1 year ago

Mmm. I'm still pretty confused. Now I don't get any nice print out at all, just this:

julia> wrapped_tree = Trees.wrap(raw_tree, (featurenames=DF.names(X),))
A wrapped Decision Tree

Same if I use feature_names.

@ablaom How did you print the text-based version? Using AbstractTrees.print_tree?

show doesn't use the wrap-logic; so just printing the wrapped_tree won't show the feature names.

sylvaticus commented 1 year ago

To extend the answer of @roland-KA , this works:

julia> using BetaML

julia> X = [1.8 2.5; 0.5 20.5; 0.6 18; 0.7 22.8; 0.4 31; 1.7 3.7];

julia> y = 2 .* X[:,1] .- X[:,2] .+ 3;

julia> mod = DecisionTreeEstimator(max_depth=10)
DecisionTreeEstimator - A Decision Tree model (unfitted)

julia> ŷ   = fit!(mod,X,y);

julia> hcat(y,ŷ)
6×2 Matrix{Float64}:
   4.1    3.4
 -16.5  -17.45
 -13.8  -13.8
 -18.4  -17.45
 -27.2  -27.2
   2.7    3.4

julia> println(mod)
DecisionTreeEstimator - A Decision Tree regressor (fitted on 6 records)
Dict{String, Any}("job_is_regression" => 1, "fitted_records" => 6, "max_reached_depth" => 4, "avg_depth" => 3.25, "xndims" => 2)
*** Printing Decision Tree: ***

1. BetaML.Trees.Question{Float64}(2, 18.0)
--> True :
                1.2. BetaML.Trees.Question{Float64}(2, 31.0)
                --> True :  -27.2
                --> False:
                        1.2.3. BetaML.Trees.Question{Float64}(2, 20.5)
                        --> True :  -17.450000000000003
                        --> False:  -13.8
--> False:  3.3999999999999995

julia> wmod = wrap(mod,featurenames=["dim1","dim2"])
A wrapped Decision Tree

julia> import AbstractTrees:print_tree

julia> print_tree(wmod)
dim2 >= 18.0?
├─ dim2 >= 31.0?
│  ├─ -27.2
│  │  
│  └─ dim2 >= 20.5?
│     ├─ -17.450000000000003
│     │  
│     └─ -13.8
│        
└─ 3.3999999999999995

(I modified the docstring to consider print_tree)

ablaom commented 1 year ago

@sylvaticus @roland-KA Thanks for the detailed explanations. I must have been sloppy with my first post and dropped the print_tree. I apologise for not checking this more carefully - very bad form.

roland-KA commented 1 year ago

No problem, we are here to clarify and explain things 🤓