Closed ablaom closed 2 years ago
Maybe you making some assumptions about the raw type of the Categorical arrays you accept?
Yes, I do (assumption is String
) in trees.jl
(which extends to nodes.jl
):
mutable struct OneTree
feature_name
nodes :: Dict{String, OneNode}
What is the best (or expected) behavior? To make no assumptions at all about the raw type?
Yes, you should not make any assumption about the raw type. So you could make your OneTree
parametric on the type of your CategoricalValue
s (and use generic code to construct them) or convert the values to String
using string
.
I'm assuming that you preserve the raw type in your target predictions? I mean, the pools of the training target and the pools of predictions should match, which implicitly requires this.
Or you can encode using reference integers of the categorical array, which you can get with MMI.int
and, if needed, decode using MMI.decoder
https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/#Extracting-an-integer-representation-of-Finite-data
Thanks for the hints! I've chosen the variant converting the values to String (see: https://github.com/roland-KA/OneRule.jl/commit/88321f53d4692b598824cfa921978d99204ccb3a). This was the easiest way to go. For larger datasets using int
and decoder
might be more performant. But as a first approach I think it's ok.
Apart from this, I also added a testset to check, if other base types work (see: https://github.com/roland-KA/OneRule.jl/commit/1ddb6c43e0c415d86351eff56f0da5736d520439).
So I hope, that first version for use within MLJ is now really ready 😊.
That's great thanks. I've checked this, so all good. Can you please tag a new (patch) release? I'll go ahead with the registry update.
@roland-KA Maybe you making some assumptions about the raw type of the Categorical arrays you accept?