roland-KA / OneRule.jl

Implementation of the 1-Rule data mining algorithm using the Julia programming language
MIT License
2 stars 0 forks source link

Problem training with `Char` elements in my categorical arrays #4

Closed ablaom closed 2 years ago

ablaom commented 2 years ago

@roland-KA Maybe you making some assumptions about the raw type of the Categorical arrays you accept?

using MLJBase, OneRule

x1 = coerce(rand("ab", 100), Multiclass);
x2 = coerce(rand("cde", 100), Multiclass)
X = (; x1, x2)
y = x1

julia> schema(X)
┌───────┬───────────────┬────────────────────────────────┐
│ names │ scitypes      │ types                          │
├───────┼───────────────┼────────────────────────────────┤
│ x1    │ Multiclass{2} │ CategoricalValue{Char, UInt32} │
│ x2    │ Multiclass{3} │ CategoricalValue{Char, UInt32} │
└───────┴───────────────┴────────────────────────────────┘

model = OneRuleClassifier()

julia> machine(model, X, y)|>fit!
[ Info: Training Machine{OneRuleClassifier,…}.
┌ Error: Problem fitting the machine Machine{OneRuleClassifier,…}. 
└ @ MLJBase ~/.julia/packages/MLJBase/rMXo2/src/machines.jl:553
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: MethodError: Cannot `convert` an object of type Char to an object of type String
Closest candidates are:
  convert(::Type{T}, ::PyCall.PyObject) where T<:AbstractString at ~/.julia/packages/PyCall/7a7w0/src/conversions.jl:92
  convert(::Type{T}, ::StringEncodings.Encodings.Encoding{enc}) where {T<:AbstractString, enc} at ~/.julia/packages/StringEncodings/hHXRr/src/encodings.jl:17
  convert(::Type{S}, ::CategoricalArrays.CategoricalValue) where S<:Union{AbstractChar, AbstractString, Number} at ~/.julia/packages/CategoricalArrays/4xBnG/src/value.jl:92
  ...
Stacktrace:
  [1] setindex!(h::Dict{String, OneRule.OneNode}, v0::OneRule.OneNode, key0::Char)
    @ Base ./dict.jl:373
  [2] get_nodes(column::CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}, target::CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}, target_labels::Vector{Char})
    @ OneRule ~/.julia/packages/OneRule/CHbXO/src/nodes.jl:31
  [3] all_trees(X::NamedTuple{(:x1, :x2), Tuple{CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}, CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}}}, y::CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}})
    @ OneRule ~/.julia/packages/OneRule/CHbXO/src/trees.jl:38
  [4] get_best_tree
    @ ~/.julia/packages/OneRule/CHbXO/src/trees.jl:20 [inlined]
  [5] fit(model::OneRule.OneRuleClassifier, verbosity::Int64, X::NamedTuple{(:x1, :x2), Tuple{CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}, CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}}}}, y::CategoricalArrays.CategoricalVector{Char, UInt32, Char, CategoricalArrays.CategoricalValue{Char, UInt32}, Union{}})
    @ OneRule ~/.julia/packages/OneRule/CHbXO/src/OneRule_MLJ.jl:13
  [6] fit_only!(mach::Machine{OneRule.OneRuleClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/rMXo2/src/machines.jl:551
  [7] fit_only!
    @ ~/.julia/packages/MLJBase/rMXo2/src/machines.jl:504 [inlined]
  [8] #fit!#60
    @ ~/.julia/packages/MLJBase/rMXo2/src/machines.jl:618 [inlined]
  [9] fit!
    @ ~/.julia/packages/MLJBase/rMXo2/src/machines.jl:616 [inlined]
 [10] |>(x::Machine{OneRule.OneRuleClassifier, true}, f::typeof(fit!))
    @ Base ./operators.jl:966
 [11] top-level scope
    @ REPL[28]:1
 [12] top-level scope
    @ ~/.julia/packages/CUDA/Uurn4/src/initialization.jl:52
roland-KA commented 2 years ago

Maybe you making some assumptions about the raw type of the Categorical arrays you accept?

Yes, I do (assumption is String) in trees.jl (which extends to nodes.jl):

mutable struct OneTree
    feature_name                             
    nodes         :: Dict{String, OneNode} 

What is the best (or expected) behavior? To make no assumptions at all about the raw type?

ablaom commented 2 years ago

Yes, you should not make any assumption about the raw type. So you could make your OneTree parametric on the type of your CategoricalValues (and use generic code to construct them) or convert the values to String using string.

I'm assuming that you preserve the raw type in your target predictions? I mean, the pools of the training target and the pools of predictions should match, which implicitly requires this.

ablaom commented 2 years ago

Or you can encode using reference integers of the categorical array, which you can get with MMI.int and, if needed, decode using MMI.decoder https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_categorical_data/#Extracting-an-integer-representation-of-Finite-data

roland-KA commented 2 years ago

Thanks for the hints! I've chosen the variant converting the values to String (see: https://github.com/roland-KA/OneRule.jl/commit/88321f53d4692b598824cfa921978d99204ccb3a). This was the easiest way to go. For larger datasets using int and decoder might be more performant. But as a first approach I think it's ok.

Apart from this, I also added a testset to check, if other base types work (see: https://github.com/roland-KA/OneRule.jl/commit/1ddb6c43e0c415d86351eff56f0da5736d520439).

So I hope, that first version for use within MLJ is now really ready 😊.

ablaom commented 2 years ago

That's great thanks. I've checked this, so all good. Can you please tag a new (patch) release? I'll go ahead with the registry update.