[INSTALL]: Proposal to resolve the cuML integration problems

beckernick commented 2 years ago

Installation check

[X] I have read the installation guide.

Platform

Linux-5.8.0-38-generic-x86_64-with-glibc2.31

Installation Method

Built from source

pycaret Version

Source build

Python Version

3.9.13

Description

https://github.com/pycaret/pycaret/issues/2710, https://github.com/pycaret/pycaret/issues/2914 , and https://github.com/pycaret/pycaret/issues/2987 are several open issues that illustrate the challenges of using recent releases of cuML with PyCaret.

I investigated these issues and believe the following summary captures the current state:

The version checking utilities cannot parse the tuple of integers that come from cuml_version processing (i.e., "22.10" -> (22, 10))
cuML depends on cuDF, which has a numba >= 0.56.2 requirement
PyCaret pins to numba 0.55. This was added in https://github.com/pycaret/pycaret/pull/2336 and may have been related to the sktime library (please correct me if I'm wrong).
sktime constrains numba >= 0.53 and numpy>=1.21.0,<1.23. Requiring numba 0.55 for sktime only may not be necessary

I believe that the following changes allow using the current cuML (22.10) or higher with PyCaret smoothly and do not cause any additional test failures.

Relaxing the numba constraint
Updating the version checking utilities

diff --git a/pycaret/internal/pycaret_experiment/tabular_experiment.py b/pycaret/internal/pycaret_experiment/tabular_experiment.py
index d070e916..cb76b08e 100644
--- a/pycaret/internal/pycaret_experiment/tabular_experiment.py
+++ b/pycaret/internal/pycaret_experiment/tabular_experiment.py
@@ -346,11 +346,8 @@ class _TabularExperiment(_PyCaretExperiment):
                 cuml_version = __version__
                 self.logger.info(f"cuml=={cuml_version}")

-                cuml_version = cuml_version.split(".")
-                cuml_version = (int(cuml_version[0]), int(cuml_version[1]))
-
             if cuml_version is None or not version.parse(cuml_version) >= version.parse(
-                "0.15"
+                "22.10"
             ):
                 message = f"cuML is outdated or not found. Required version is >=0.15, got {__version__}"
                 if use_gpu == "force":
diff --git a/requirements.txt b/requirements.txt
index aec84845..da543a87 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -12,7 +12,7 @@ pyod>=0.9.8
 imbalanced-learn>=0.8.1
 category-encoders>=2.4.0
 lightgbm>=3.0.0
-numba~=0.55.0
+numba>=0.55.0
 requests>=2.27.1  # Required by pycaret.datasets
 psutil>=5.9.0
 markupsafe>=2.0.1  # Fixes Google Colab issue

With PyCaret installed from source and updated like above in the following conda environment, things work as expected:

mamba create -n rapids-22.10-pycaret -c rapidsai -c nvidia -c conda-forge rapids=22.10 python=3.9 cudatoolkit=11.5 jupyterlab strings_udf
conda activate rapids-22.10-pycaret
git clone https://github.com/pycaret/pycaret.git
cd pycaret
python -m pip install .

Testing

I ran the pytests locally with the patch above in an environment including the full set of dependencies from requirements-test.txt and requiremens-optional.txt to see if anything failed. I saw several failures, so I tested with a clean environment with a fresh pycaret source build (with no changes). In both environments, the same 8 tests failed, suggesting that this change probably does not cause any net new test failures:

Test failures with standard PyCaret built from source:

===================================================================== short test summary info =====================================================================
FAILED tests/test_check_fairness.py::test_check_fairness_multiclass_classification - TypeError: object of type 'bool' has no len()
FAILED tests/test_classification_plots.py::test_plot - _tkinter.TclError: invalid command name ".!navigationtoolbar2tk.!button2"
FAILED tests/test_nlp.py::test_nlp - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_setup_fails_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_create_model_fails_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_setup_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_create_models_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_regression_plots.py::test_plot - _tkinter.TclError: invalid command name ".!navigationtoolbar2tk.!button2"
============================================== 8 failed, 549 passed, 3 skipped, 4823 warnings in 3515.72s (0:58:35) ===============================================

Identical test failures with PyCaret built from source with the above patch:

===================================================================== short test summary info =====================================================================
FAILED tests/test_check_fairness.py::test_check_fairness_multiclass_classification - TypeError: object of type 'bool' has no len()
FAILED tests/test_classification_plots.py::test_plot - _tkinter.TclError: invalid command name ".!navigationtoolbar2tk.!button2"
FAILED tests/test_nlp.py::test_nlp - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_setup_fails_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_create_model_fails_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_setup_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_nlp.py::TestNLPExperimentCustomTags::test_nlp_create_models_with_experiment_custom_tags - OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
FAILED tests/test_regression_plots.py::test_plot - _tkinter.TclError: invalid command name ".!navigationtoolbar2tk.!button2"
============================================== 8 failed, 549 passed, 3 skipped, 4689 warnings in 3481.17s (0:58:01) ===============================================

Given the above, @ngupta23 @Yard1 , would you be open to accepting a PR to unblock using the current cuML with the current PyCaret?

cc @dantegd @wphicks (awareness)

Installation Logs

Replace this line with the installation logs.

moezali1 commented 2 years ago

@Yard1 What are your thoughts on this?

ngupta23 commented 2 years ago

The changes to numba pinning (i.e. relaxing it) look to be ok from my perspective.

ngupta23 commented 2 years ago

@beckernick Feel free to submit the PR. If it passes on GitHub, we can accept it. I think you are missing the vocabulary dictionary locally which is why the local tests are failing.

beckernick commented 1 year ago

Yeah, you're right. Looks like 5/8 are from not downloading the spacy model ahead of time.

And sounds good, thanks! Will open a PR (may take a few more days due to some internal approvals contributing to a new project).

ngupta23 commented 1 year ago

Sounds good. If you get the latest master, some of the remaining 3 tests should be fixed as well.

pycaret / pycaret