Closed jameslamb closed 3 months ago
The error is likely due to the change of XGBoost version. Starting with 2.1.0 version, XGBoost defaults to using UBJSON format when saving the model.
Treelite 4.3 contains support for UBJSON, but regrettably FIL was not yet updated to recognize the UBJSON format, hence the error. Let me prepare a pull request.
Describe the bug
The
forest_inference_demo.ipynb
notebook here is broken. XGBoost model loading with FIL is failing.I've observed this behavior on the 24.08 release of
cuml
and all its dependencies. I suspect it's a problem on 24.10 as well, but haven't tested that yet.Steps/Code to reproduce bug
Created a conda environment and installed
cuml
,jupyterlab
, andxgboost
into it.setup (click me)
Ran the following from the root of the repo, on a machine with V100s and CUDA 12.2. ```shell conda env create \ --name cuml-cu12-dev \ --file ./conda/environments/all_cuda-125_arch-x86_64.yaml source activate cuml-cu12-dev conda install \ -c conda-forge \ -c rapidsai-nightly \ -c rapidsai \ --yes \ cuml=24.8.* \ jupyterlab ```Then launched JupyterLab.
Ran the cells in
notebooks/forest_inference_demo.ipynb
in order.This call to
ForestInference.load()
https://github.com/rapidsai/cuml/blob/e571abaf068b21173984e07b73c91bf0be8da7b5/notebooks/forest_inference_demo.ipynb#L273
Fails like this:
This same error can be seen in the most recent run of this notebook in the CI for
rapidsai/docker
: https://github.com/rapidsai/docker/actions/runs/10244736365/job/28356773321#step:9:15Expected behavior
Expected this notebook to run end-to-end without error.
Environment details (please complete the following information):
output of 'conda info' (click me)
```text active environment : cuml-cu12-dev active env location : /raid/jlamb/miniforge/envs/cuml-cu12-dev shell level : 1 user config file : /home/nfs/jlamb/.condarc populated config files : /raid/jlamb/miniforge/.condarc /home/nfs/jlamb/.condarc conda version : 23.7.4 conda-build version : 24.5.1 python version : 3.10.12.final.0 virtual packages : __archspec=1=x86_64 __cuda=12.2=0 __glibc=2.31=0 __linux=5.4.0=0 __unix=0=0 base environment : /raid/jlamb/miniforge (writable) conda av data dir : /raid/jlamb/miniforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /raid/jlamb/miniforge/pkgs /home/nfs/jlamb/.conda/pkgs envs directories : /raid/jlamb/miniforge/envs /home/nfs/jlamb/.conda/envs platform : linux-64 user-agent : conda/23.7.4 requests/2.32.3 CPython/3.10.12 Linux/5.4.0-182-generic ubuntu/20.04.6 glibc/2.31 UID:GID : 10349:10004 netrc file : None offline mode : False ```output of 'nvidia-smi' (click me)
```text +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-32GB On | 00000000:06:00.0 Off | 0 | | N/A 31C P0 41W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-32GB On | 00000000:07:00.0 Off | 0 | | N/A 33C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-32GB On | 00000000:0A:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-32GB On | 00000000:0B:00.0 Off | 0 | | N/A 29C P0 41W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2-32GB On | 00000000:85:00.0 Off | 0 | | N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2-32GB On | 00000000:86:00.0 Off | 0 | | N/A 30C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2-32GB On | 00000000:89:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2-32GB On | 00000000:8A:00.0 Off | 0 | | N/A 31C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ```Additional context
This was only noticed because of a CI failure over in
rapidsai/docker
: https://github.com/rapidsai/docker/pull/699#discussion_r1704654898.Ideally, it could be caught here in
cuml
's CI. As of this writing, this notebook is not tested in CI.(build link)
This notebook has been running in
rapidsai/docker
CI for a while. It passed on 24.08 as recently as 2 weeks ago.(build link)
So I suspect this is a result of a recent change. Maybe some mix of these: