roland-KA / OneRule.jl

Implementation of the 1-Rule data mining algorithm using the Julia programming language
MIT License
2 stars 0 forks source link

Feedback on MLJ integration #2

Open ablaom opened 2 years ago

ablaom commented 2 years ago

Providing here a code review with a view to registering the OneRule.jl model with MLJ.

@roland-KA Thanks for taking the time to wrap your head around the MLJ model interface. I realize this is a bit of work.

The MLJ interface is looking pretty solid. The main issue is that the data X that the MLJ user binds to your model is presently restricted to be a DataFrame, but you need to allow any object implementing the Tables.jl interface. I've browsed the details of your get_best_tree and I think this can be addressed pretty easily:

# access data of input table `x` column-by-column
# Tables.columns returns an object where individul, entire columns can be accessed
columns = Tables.columns(x)

# iterate through each column name in table
for col in Tables.columnnames(columns)
    # retrieve entire column by column name
    # a column is an indexable collection
    # with known length (i.e. supports
    # `length(column)` and `column[i]`)
    column = Tables.getcolumn(columns, col)
end

I believe you need to collect the column object if you need it be concrete vector, and not merely "indexable collection with known length".

Some more minor comments:

roland-KA commented 2 years ago

Hi @ablaom, thank you very much for the detailed analysis of the OneRule code! I've already suspected that the use of DataFrames at that point might be a problem. But I had no idea how and in which way. So this is really a good help.

I will create now a new release of OneRule to address all the points you mentioned.

I have one additional question with respect to the use of different data types for X and y: As you've mentioned above, the features in X are processed column by column in get_best_tree. Then get_nodes (from nodes.jl) is called passing one feature column and the target vector to this function.

get_nodes expects an AbstractVector for this data:

function get_nodes(column::AbstractVector, target::AbstractVector ...

Let's assume, that I implement an additional variant of get_nodes (applying Julias dispatch mechanism) that accepts only CategoricalVectors for these two parameters:

function get_nodes(column::CategoricalVector, target:: CategoricalVector ...

Would that be ok for MLJ (as there is one variant accepting AbstractVectors) or do you see any issues with this?

This is just a theoretical question for OneRule. But I'm working currently on another model that processes X also column-wise. And some benchmarking showed that if the data is passed as CategoricalVectors, I can achieve a significant performance gain on the fit-operation, by implementing a version that utilizes the specific features of a CategoricalVector.

ablaom commented 2 years ago

Your input_scitype guarantees that the columns you peel off have Finite (edited) scientific element type. Strictly speaking, that only guarantees that the elements of your vector have type CategoricalValue. So for example, you could get a vector like v below:

julia> y = categorical(1:5)
5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
 4
 5

julia> v =Any[y[1], y[2]]
2-element Vector{Any}:
 CategoricalArrays.CategoricalValue{Int64, UInt32} 1
 CategoricalArrays.CategoricalValue{Int64, UInt32} 2

julia> scitype(v)
AbstractVector{Multiclass{5}} (alias for AbstractArray{Multiclass{5}, 1})

So, strictly speaking, you may still want a fallback for get_nodes to catch such objects.

99.9% of the time however, you'll get a bone fide CategoricalVector, which your specialised method will catch.

(It's a remote possibility that the behavior of ScientificTypes would change slightly here, but only to rule out the exceptional case just mentioned, rather than to become even more relaxed.)

ablaom commented 2 years ago

In case you had not noticed, it's actually difficult to construct vectors of CategoricalValue that are not CategoricalVector. For example, [y[1], y[2]] and [y[i] for i in eachindex(y)] are bone fide CategoricalVectors.

roland-KA commented 2 years ago

Hmm, am I understanding this correctly? Even if an 'ordinary' Array of e.g. Strings is passed to fit, this gets converted to a CategoricalArray because the input_scitype is declared being Finite (and thus it has to be coerced to Multiclass before passing it to fit)?

roland-KA commented 2 years ago

Hi @ablaom, the changes to OneRule have just been released. Perhaps you could have another look on this model and check, if all requirements for MLJ are met now?

BTW: The new docstring standard is really a great idea. I think it makes the use of MLJ models much easier. 😊👍

roland-KA commented 2 years ago

Hi @ablaom, I've added a comment to alan-turing-institute/MLJ.jl#907, where I summarized what I've learned about ScientificTypes and CategoricalArrays from our discussion. I think that (hopefully 😊) I have now a better understanding of several aspects which weren't quite clear to me in the beginning. So it would be nice, if you could check that comment too.

Perhaps some of these points are also helpful for other MLJ users and can be added somewhere to the documentation/tutorials.

ablaom commented 2 years ago

Hmm, am I understanding this correctly? Even if an 'ordinary' Array of e.g. Strings is passed to fit, this gets converted to a CategoricalArray because the input_scitype is declared being Finite (and thus it has to be coerced to Multiclass before passing it to fit)?

No the user must convert an array of strings to a categorical array (using coerce(Xtable, Textual=>Multiclass), say) for use with your model. MLJ will not do this is for her. If she tries to pass strings, she will get a warning about the scitype of her data not matching the scitype declared by the model. Okay, I see from the doc-string you understand ✔️

BTW, I will review the latest changes and carry out some integration tests shortly. For future reference, making changes as a PR would make review easier.

ablaom commented 2 years ago

Great work on the docstring, thanks!

I've inserted some (minor) comments on the commit.

Ping me when you've addressed these and I'll run my final integration tests.

roland-KA commented 2 years ago

Hi @ablaom, I've changed the few points you mentioned (and left some comments and answers there) and committed the changes to the OneRule-repository (but I haven't published a new release of OneRule yet).

Perhaps you can have a look at them now? ... and to my comment at the end of https://github.com/alan-turing-institute/MLJ.jl/issues/907 too, in order to see, if my understanding now is correct?

For future reference, making changes as a PR would make review easier.

To be honest, I didn't quite understand this point. Do you mean a PR to the OneRule-repository or a PR to MLJ for integrating the model there?

ablaom commented 2 years ago

Thanks for you patience. I'm happy with your responses. The only thing you may decide to address before tagging a release is my comment about fitted_params, as changes there could be breaking.

Do ping me again when you're ready for registration. I'm planning to update the registry early next week.

ablaom commented 2 years ago

To be honest, I didn't quite understand this point. Do you mean a PR to the OneRule-repository or a PR to MLJ for integrating the model there?

I meant as PR onto OneRule. So I don't have to dig out the commits and we get a clearer record of discussion.

roland-KA commented 2 years ago

Thanks for you patience

No problem, I've been busy with other things anyway. I hope you are well again.

The only thing you may decide to address before tagging a release is my comment about fitted_params

I've changed the 2nd parameter of fitted_params, so that it is useful externally as well as for my internal purposes. It contains now all target labels present in levels of the training data and is called all_classes. So it is quite similar to what DecisionTreeClassifier in MLJDecisionTreeInterface is doing.

The information about feature and class labels have already been present in the return values of report (see also my comment about the purposes of fitted_params and report).

Depending on your feedback, I will tag a new release.

ablaom commented 2 years ago

See my comments added to latest commit, thanks.

roland-KA commented 2 years ago

Hi @ablaom, thanks for checking my code so thoroughly 🙏.

So, hopefully things are now ready to tag a new release and for integration into MLJ 😀.

ablaom commented 2 years ago

Looks good to me. Go ahead and tag a new release.

roland-KA commented 2 years ago

Done 😊 ... @ablaom is there now any further action from my side required to integrate the model into MLJ?

ablaom commented 2 years ago

No. I'll register the model tomorrow. Thanks for your patience.

roland-KA commented 2 years ago

Ok perfect! ... and thanks for your help.