Open ablaom opened 2 years ago
Hi @ablaom, thank you very much for the detailed analysis of the OneRule
code! I've already suspected that the use of DataFrames
at that point might be a problem. But I had no idea how and in which way. So this is really a good help.
I will create now a new release of OneRule
to address all the points you mentioned.
I have one additional question with respect to the use of different data types for X
and y
:
As you've mentioned above, the features in X
are processed column by column in get_best_tree
. Then get_nodes
(from nodes.jl
) is called passing one feature column and the target vector to this function.
get_nodes
expects an AbstractVector
for this data:
function get_nodes(column::AbstractVector, target::AbstractVector ...
Let's assume, that I implement an additional variant of get_nodes
(applying Julias dispatch mechanism) that accepts only CategoricalVector
s for these two parameters:
function get_nodes(column::CategoricalVector, target:: CategoricalVector ...
Would that be ok for MLJ (as there is one variant accepting AbstractVector
s) or do you see any issues with this?
This is just a theoretical question for OneRule
. But I'm working currently on another model that processes X
also column-wise. And some benchmarking showed that if the data is passed as CategoricalVector
s, I can achieve a significant performance gain on the fit
-operation, by implementing a version that utilizes the specific features of a CategoricalVector
.
Your input_scitype
guarantees that the columns you peel off have Finite
(edited) scientific element type. Strictly speaking, that only guarantees that the elements of your vector have type CategoricalValue
. So for example, you could get a vector like v
below:
julia> y = categorical(1:5)
5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
3
4
5
julia> v =Any[y[1], y[2]]
2-element Vector{Any}:
CategoricalArrays.CategoricalValue{Int64, UInt32} 1
CategoricalArrays.CategoricalValue{Int64, UInt32} 2
julia> scitype(v)
AbstractVector{Multiclass{5}} (alias for AbstractArray{Multiclass{5}, 1})
So, strictly speaking, you may still want a fallback for get_nodes
to catch such objects.
99.9% of the time however, you'll get a bone fide CategoricalVector
, which your specialised method will catch.
(It's a remote possibility that the behavior of ScientificTypes would change slightly here, but only to rule out the exceptional case just mentioned, rather than to become even more relaxed.)
In case you had not noticed, it's actually difficult to construct vectors of CategoricalValue
that are not CategoricalVector
. For example, [y[1], y[2]]
and [y[i] for i in eachindex(y)]
are bone fide CategoricalVector
s.
Hmm, am I understanding this correctly? Even if an 'ordinary' Array of e.g. Strings is passed to fit
, this gets converted to a CategoricalArray
because the input_scitype
is declared being Finite
(and thus it has to be coerce
d to Multiclass
before passing it to fit
)?
Hi @ablaom, the changes to OneRule
have just been released. Perhaps you could have another look on this model and check, if all requirements for MLJ are met now?
BTW: The new docstring standard is really a great idea. I think it makes the use of MLJ models much easier. 😊👍
Hi @ablaom, I've added a comment to alan-turing-institute/MLJ.jl#907, where I summarized what I've learned about ScientificTypes
and CategoricalArrays
from our discussion. I think that (hopefully 😊) I have now a better understanding of several aspects which weren't quite clear to me in the beginning. So it would be nice, if you could check that comment too.
Perhaps some of these points are also helpful for other MLJ users and can be added somewhere to the documentation/tutorials.
Hmm, am I understanding this correctly? Even if an 'ordinary' Array of e.g. Strings is passed to fit, this gets converted to a CategoricalArray because the input_scitype is declared being Finite (and thus it has to be coerced to Multiclass before passing it to fit)?
No the user must convert an array of strings to a categorical array (using Okay, I see from the doc-string you understand ✔️ coerce(Xtable, Textual=>Multiclass)
, say) for use with your model. MLJ will not do this is for her. If she tries to pass strings, she will get a warning about the scitype of her data not matching the scitype declared by the model.
BTW, I will review the latest changes and carry out some integration tests shortly. For future reference, making changes as a PR would make review easier.
Great work on the docstring, thanks!
I've inserted some (minor) comments on the commit.
Ping me when you've addressed these and I'll run my final integration tests.
Hi @ablaom, I've changed the few points you mentioned (and left some comments and answers there) and committed the changes to the OneRule-repository (but I haven't published a new release of OneRule
yet).
Perhaps you can have a look at them now? ... and to my comment at the end of https://github.com/alan-turing-institute/MLJ.jl/issues/907 too, in order to see, if my understanding now is correct?
For future reference, making changes as a PR would make review easier.
To be honest, I didn't quite understand this point. Do you mean a PR to the OneRule-repository or a PR to MLJ for integrating the model there?
Thanks for you patience. I'm happy with your responses. The only thing you may decide to address before tagging a release is my comment about fitted_params, as changes there could be breaking.
Do ping me again when you're ready for registration. I'm planning to update the registry early next week.
To be honest, I didn't quite understand this point. Do you mean a PR to the OneRule-repository or a PR to MLJ for integrating the model there?
I meant as PR onto OneRule. So I don't have to dig out the commits and we get a clearer record of discussion.
Thanks for you patience
No problem, I've been busy with other things anyway. I hope you are well again.
The only thing you may decide to address before tagging a release is my comment about fitted_params
I've changed the 2nd parameter of fitted_params
, so that it is useful externally as well as for my internal purposes. It contains now all target labels present in levels
of the training data and is called all_classes
. So it is quite similar to what DecisionTreeClassifier
in MLJDecisionTreeInterface
is doing.
The information about feature and class labels have already been present in the return values of report
(see also my comment about the purposes of fitted_params
and report
).
Depending on your feedback, I will tag a new release.
See my comments added to latest commit, thanks.
Hi @ablaom, thanks for checking my code so thoroughly 🙏.
fitted_params
(and the doctoring
) accordinglyruntests.jl
for report
and fitted_params
So, hopefully things are now ready to tag a new release and for integration into MLJ 😀.
Looks good to me. Go ahead and tag a new release.
Done 😊 ... @ablaom is there now any further action from my side required to integrate the model into MLJ?
No. I'll register the model tomorrow. Thanks for your patience.
Ok perfect! ... and thanks for your help.
Providing here a code review with a view to registering the OneRule.jl model with MLJ.
@roland-KA Thanks for taking the time to wrap your head around the MLJ model interface. I realize this is a bit of work.
The MLJ interface is looking pretty solid. The main issue is that the data
X
that the MLJ user binds to your model is presently restricted to be a DataFrame, but you need to allow any object implementing the Tables.jl interface. I've browsed the details of yourget_best_tree
and I think this can be addressed pretty easily:::DataFrame
type annotations everywhere.X
has unknown type (but implements Tables.jl). You can remove the hefty DataFrames pkg as dependency, and replace it with Tables.jl (very lightweight). If you want, you can keep DataFrames for testing, but I usually just test with a Julia native table such as named tuples of equal-length vectors to avoid the test dependency.Tables.columns(X)
and work with that. Here's an example taken straight from the Tables.jl docs that probably provides most of what you need:I believe you need to
collect
thecolumn
object if you need it be concrete vector, and not merely "indexable collection with known length".Some more minor comments:
metadata_model
declaration. This now falls back to the doc-string you provideload_path
is incomplete; you want "OneRule.OneRuleClassifier".is_wrapper
andsupports_weights
, as the fallbacks are already correct