rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[BUG] `forest_inference_demo.ipynb` is broken #6008

Closed jameslamb closed 3 months ago

jameslamb commented 3 months ago

Describe the bug

The forest_inference_demo.ipynb notebook here is broken. XGBoost model loading with FIL is failing.

I've observed this behavior on the 24.08 release of cuml and all its dependencies. I suspect it's a problem on 24.10 as well, but haven't tested that yet.

Steps/Code to reproduce bug

Created a conda environment and installed cuml, jupyterlab, and xgboost into it.

setup (click me) Ran the following from the root of the repo, on a machine with V100s and CUDA 12.2. ```shell conda env create \ --name cuml-cu12-dev \ --file ./conda/environments/all_cuda-125_arch-x86_64.yaml source activate cuml-cu12-dev conda install \ -c conda-forge \ -c rapidsai-nightly \ -c rapidsai \ --yes \ cuml=24.8.* \ jupyterlab ```

Then launched JupyterLab.

jupyter lab --ip 0.0.0.0 --port 1234

Ran the cells in notebooks/forest_inference_demo.ipynb in order.

This call to ForestInference.load()

https://github.com/rapidsai/cuml/blob/e571abaf068b21173984e07b73c91bf0be8da7b5/notebooks/forest_inference_demo.ipynb#L273

Fails like this:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[12], line 1
----> 1 fil_model = ForestInference.load(
      2     filename=model_path,
      3     algo='BATCH_TREE_REORG',
      4     output_class=True,
      5     threshold=0.50,
      6     model_type='xgboost'
      7 )

File fil.pyx:1033, in cuml.fil.fil.ForestInference.load()

File fil.pyx:212, in cuml.fil.fil.TreeliteModel.from_filename()

RuntimeError: Failed to load xgb.model (basic_string::_M_replace_aux)

This same error can be seen in the most recent run of this notebook in the CI for rapidsai/docker: https://github.com/rapidsai/docker/actions/runs/10244736365/job/28356773321#step:9:15

Expected behavior

Expected this notebook to run end-to-end without error.

Environment details (please complete the following information):

output of 'conda info' (click me) ```text active environment : cuml-cu12-dev active env location : /raid/jlamb/miniforge/envs/cuml-cu12-dev shell level : 1 user config file : /home/nfs/jlamb/.condarc populated config files : /raid/jlamb/miniforge/.condarc /home/nfs/jlamb/.condarc conda version : 23.7.4 conda-build version : 24.5.1 python version : 3.10.12.final.0 virtual packages : __archspec=1=x86_64 __cuda=12.2=0 __glibc=2.31=0 __linux=5.4.0=0 __unix=0=0 base environment : /raid/jlamb/miniforge (writable) conda av data dir : /raid/jlamb/miniforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /raid/jlamb/miniforge/pkgs /home/nfs/jlamb/.conda/pkgs envs directories : /raid/jlamb/miniforge/envs /home/nfs/jlamb/.conda/envs platform : linux-64 user-agent : conda/23.7.4 requests/2.32.3 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 UID:GID : 10349:10004 netrc file : None offline mode : False ```
output of 'nvidia-smi' (click me) ```text +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-32GB On | 00000000:06:00.0 Off | 0 | | N/A 31C P0 41W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-32GB On | 00000000:07:00.0 Off | 0 | | N/A 33C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-32GB On | 00000000:0A:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-32GB On | 00000000:0B:00.0 Off | 0 | | N/A 29C P0 41W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2-32GB On | 00000000:85:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2-32GB On | 00000000:86:00.0 Off | 0 | | N/A 30C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2-32GB On | 00000000:89:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2-32GB On | 00000000:8A:00.0 Off | 0 | | N/A 31C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ```

Additional context

This was only noticed because of a CI failure over in rapidsai/docker: https://github.com/rapidsai/docker/pull/699#discussion_r1704654898.

Ideally, it could be caught here in cuml's CI. As of this writing, this notebook is not tested in CI.

SKIPPING: ./forest_inference_demo.ipynb (suspected Dask usage, not currently automatable)

(build link)

This notebook has been running in rapidsai/docker CI for a while. It passed on 24.08 as recently as 2 weeks ago.

Testing cuml/forest_inference_demo.ipynb
Completed cuml/forest_inference_demo.ipynb with 1 warnings and 0 errors

(build link)

So I suspect this is a result of a recent change. Maybe some mix of these:

hcho3 commented 3 months ago

The error is likely due to the change of XGBoost version. Starting with 2.1.0 version, XGBoost defaults to using UBJSON format when saving the model.

Treelite 4.3 contains support for UBJSON, but regrettably FIL was not yet updated to recognize the UBJSON format, hence the error. Let me prepare a pull request.