Open ablaom opened 2 years ago
Thank you for taking the time to review the repo. I will make all the changes to bring it to the MLJ model standard. I thought that the scitypes were incorrect, so thank you for the information on that too.
You might find this new tool useful. You can use it to test your interfaces. Eventually you might want to include it in CI but for now it remains experimental.
You should use the dev
branch.
Thank you for sharing. This looks like a great addition.
@ablaom Thank you again for reviewing the library. Here is a PR with some of the improvements that you suggested: #25
- This scitype declaration for a classifier is not appropriate for an MLJ model, as users provide targets for classification in the form of categorical vectors (element scitype
Finite
, which includesOrderedFactor
- for ordered categoricals - andMulticlass
- for unordered ones). So, for classification, thetarget_scitype
you typically have isAbstractVector{<:Finite}
unless it's binary only, which would beAbstractVector{<:Finite{2}}
. Furthermore, one needs to take into account all the classes in the pool of the target, not just those manifest as entries seen in the training vector. This is so that the target youpredict
includes all classes. (See this discussion). This generally means re-encoding the data using the categorical reference integers for passing to the core model, but also somehow recording the full pool and a method to invert the conversion in the prediction phase. This is discussed in the manual but there are also lots of existing implementations to adapt, for example this DecisionTreeClassifier for probablistic predictors, and this SVM model for deterministic (point) predictions. 2.
Unless I am missing something, all the classification methods should be able to handle multi class classification.
I am a little confused on implementing the encoding of categorical variables. Because we pass the y vector to python, the model will be able to predict any class that is present in the training data.
- If your clustering models have a
predict
, then they should predict categorical vectors. I didn't check this.
This should now be the case. I added the casting to MMI.categorical
, because we get a normal array back from python.
- The same scitype declaration referenced in 1. also allows a
Table
type and I wonder if that was intenentional. Tabular targets are only used in multitarget models (one column per target). Is this actually supported in the current case?
This was incorrect. I have fixed the target scitype.
- A lot of
model_metadata
entries in this pkg indicate that you can use tables and matrices for input but the examples in the doc-strings only use matrices. It's fine if your model indeed can handle both (is that the case?) but generally MLJ users are used to inputs being tabular (wrapping matrices as tables if necessary) so probably the doc-strings should include tabular examples. See the DecisioinTreeClassifier docstring for examples.
The machines should work with table/matrix inputs for the features. I will add some examples with this.
- This is probably an issue you want to address last, but MLJ now has a hard requirement for model docstrings, which must comply to this standard
I updated the docstrings for the classifiers (LogisticRegression,...). Let me know what you think.
A side note, if you do try to install it, there is a problem with micromamba installing rapids right now. The workaround is setting the solver as mamba by setting the following env variable: JULIA_CONDAPKG_EXE=mamba
.
@tylerjthomas9 Thanks indeed for this update. Currently swamped with reviews and JuliaCon prep, but I will return to this in due course.
Having trouble getting using RAPIDS
to work. I'm using mamba as suggested above.
Any ideas?
WARNING: Method definition load_path(Type{var"#s446"} where var"#s446"<:RAPIDS.HDBSCAN) in module RAPIDS at /home/ubuntu/Julia/RAPIDS/src/cuml/clustering.jl:135 overwritten at /home/ubuntu/Julia/RAPIDS/src/cuml/dimensionality_reduction.jl:227.
** incremental compilation may be fatally broken for this module **
ERROR: InitError: Python: ImportError: /opt/julia-1.7.3/bin/../lib/julia/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/ubuntu/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/pyarrow/../../../libarrow.so.700)
Python stacktrace:
[1] <module>
@ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/pyarrow/__init__.py:65
[2] <module>
@ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/core/dtypes.py:10
[3] <module>
@ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/api/types.py:18
[4] <module>
@ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/api/__init__.py:3
[5] <module>
@ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/__init__.py:12
Stacktrace:
[1] pythrow()
@ PythonCall ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/err.jl:94
[2] errcheck
@ ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/err.jl:10 [inlined]
[3] pyimport(m::String)
@ PythonCall ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/concrete/import.jl:11
[4] __init__()
@ RAPIDS ~/Julia/RAPIDS/src/RAPIDS.jl:37
[5] _include_from_serialized(path::String, depmods::Vector{Any})
@ Base ./loading.jl:768
[6] _require_from_serialized(path::String)
@ Base ./loading.jl:821
[7] _require(pkg::Base.PkgId)
@ Base ./loading.jl:1130
[8] require(uuidkey::Base.PkgId)
@ Base ./loading.jl:1013
[9] require(into::Module, mod::Symbol)
@ Base ./loading.jl:997
during initialization of module RAPIDS
I'm using the version on main
branch.
I'm using the version on
main
branch.
I am trying to narrow down this issue. I can build it without issues on one of my computers, but I get the same error as you on a different computer. I'll let you know as soon as I can get the builds working everywhere again. It seems that something broke right after i published v0.1.0, and building the environment has become less consistent.
Yeah, I can appreciate this kind of multi-language GPU installation is very tricky. I will wait for you to sort this out before proceeding with a review.
The nightly builds of julia should contain the GCC updates required for the most recent versions of RAPIDS (https://github.com/JuliaLang/julia/pull/45582#event-7730373179). Additionally, the micromamba build bug has been fixed. I am going to start development on this library again now that these roadblocks are clearing up. I will let you know when it's ready for review.
@ablaom I think that the package is in a good place for review. You will need to use a nightly julia build to run it. Let me know if you have any questions.
Awesome. Will 1.8.3 (currently release-1.8) suffice? If so I might wait for that.
Awesome. Will 1.8.3 (currently release-1.8) suffice? If so I might wait for that.
I think the backport is scheduled to come with 1.8.4. We can circle back when that comes out.
Yes. I know this is frustrating, but I prefer to prioritise review to interfaces that will work with current Julia. Hang in there.
Yes. I know this is frustrating, but I prefer to prioritise review to interfaces that will work with current Julia. Hang in there.
Sounds good. I'll ping you when it's ready.
Hey @ablaom, I put a little time into getting this library to a similar level as CatBoost.jl. I think that it is in a good spot for a review when you have the capacity. There is absolutely no rush.
I want to eventually move away from Python and wrap the C++/CUDA code directly, but I won't have the capacity to tackle this transition for a while. Because of this eventual transition at some point in the future, I think that this would be a good addition to the third-party MLJ models.
Okay, I'm having trouble because of this error:
"ERROR: LoadError: InitError: IOError: write: no space left on device (ENOSPC)"
I've checked by disk and there is enough space. I understand this may be tricky to diagnose. Two facts lead me to kick the can on this one: my old GPU is to be replaced; after next week I am between contracts and I'm going to loose access. So I'll try again when the dust settles. Thanks for your patience.
I have never seen that error before. If it is a pre-pascal GPU, then it might not work. There is absolutely no rush.
Further to https://github.com/JuliaAI/MLJModels.jl/issues/454, I have been asked to provide feedback on the implementation of the MLJ model interface package for models provided by RAPIDS.jl, with a view to registering the models in MLJ's model registry.
First, it's clear that a lot of effort has been put into understanding and implementing the interface, which is really awesome. I'm pretty optimistic we can close the necessary gaps to get this across the finish line.
I have not had an opportunity for a thorough review but wanted to get the ball rolling with a few comments. Only the first is more serious.
This scitype declaration for a classifier is not appropriate for an MLJ model, as users provide targets for classification in the form of categorical vectors (element scitype
Finite
, which includesOrderedFactor
- for ordered categoricals - andMulticlass
- for unordered ones). So, for classification, thetarget_scitype
you typically have isAbstractVector{<:Finite}
unless it's binary only, which would beAbstractVector{<:Finite{2}}
. Furthermore, one needs to take into account all the classes in the pool of the target, not just those manifest as entries seen in the training vector. This is so that the target youpredict
includes all classes. (See this discussion). This generally means re-encoding the data using the categorical reference integers for passing to the core model, but also somehow recording the full pool and a method to invert the conversion in the prediction phase. This is discussed in the manual but there are also lots of existing implementations to adapt, for example this DecisionTreeClassifier for probablistic predictors, and this SVM model for deterministic (point) predictions. 2.If your clustering models have a
predict
, then they should predict categorical vectors. I didn't check this.The same scitype declaration referenced in 1. also allows a
Table
type and I wonder if that was intenentional. Tabular targets are only used in multitarget models (one column per target). Is this actually supported in the current case?A lot of
model_metadata
entries in this pkg indicate that you can use tables and matrices for input but the examples in the doc-strings only use matrices. It's fine if your model indeed can handle both (is that the case?) but generally MLJ users are used to inputs being tabular (wrapping matrices as tables if necessary) so probably the doc-strings should include tabular examples. See the DecisioinTreeClassifier docstring for examples.This is probably an issue you want to address last, but MLJ now has a hard requirement for model docstrings, which must comply to this standard
At this early stage, you may just want to comment on these. If you do want to work on new commits to address them, can you please put them in unmerged PR, so I can more easily comment on any changes.
Happy to take a call if you think that would be efficient at this early stage of the review.