rapidsai / legate-boost

GBM implementation on Legate
https://rapidsai.github.io/legate-boost/
Apache License 2.0
8 stars 6 forks source link

add conda builds #129

Closed jameslamb closed 2 weeks ago

jameslamb commented 1 month ago

Contributes to #115

Adds conda builds.

Notes for Reviewers

This is not even close to ready for review. Just putting it up here to show the direction I'm going. This diff will get smaller as other PRs are merged.

Design choices

using GitHub Artifact store instead of RAPIDS S3 bucket

details (click me) RAPIDS is looking to eventually move away from its S3-based strategy for storing and hosting CI artifacts. I wanted to use this as a test case for that. GitHub Artifacts is also nice because, unlike https://downloads.rapids.ai/, it's on the public internet... so any outside-of-NVIDIA contributors can see their CI artifacts.

building *_gpu and *_cpu* variants

details (click me) `legate-core` and `cunumeric` do something very similar. `legate-core` / `cunumeric` did not want to add a `*_gpu` portion to the build string when I proposed it, because they considered changing the build string format to be an unacceptable breaking change. Since these packages for `legate-boost` are brand new, I think we *should* do that here, to achieve the following semantics: ```shell # only installs the GPU variant (or fails if it's not installable) conda install legate-core=24.06=*_gpu # only installs the CPU variant (or fails if it's not installable) conda install legate-core=24.06=*_cpu # installs the GPU variant if CUDA is detected (look for '__cuda' in the output of 'conda info'), # otherwise installs the CPU variant conda install legate-core ```

Building and testing in separate environments

details (click me) On `main`, this project's CI is building and testing in the same environment. This can lead to issues like those described in #144, where packaging problems are silently missed. Building in one place and testing in another has a few nice benefits: * less GPU runner utilization - *builds happen on CPU-only runner* * CI is more likely to be able to detect packaging problems - *tests just install pre-built packages and all the dependencies they declare... could catch things like "this package needs `scipy` at runtime but it isn't in the package dependencies"* * greater confidence in CPU support - *CPU tests run on a machine that doesn't have CUDA or a GPU* - *installation of the CPU-only variant tested there too* * tests can run concurrently - *CPU tests and GPU tests run in physically separate instances, so can happen at the same time, which increases the amount of testing that can be done for any fixed amount of CI time*

How to test this locally

See the notes added to contributing.md in the diff here (which also includes more explanation of these commands).

In short:

docker run \
  --rm \
  -v $(pwd):/opt/legate-boost \
  -w /opt/legate-boost \
  -it rapidsai/ci-conda:cuda12.5.1-ubuntu22.04-py3.11 \
  bash

gh auth login

RUN_ID=10584679029
gh run download \
    --dir "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
    --repo rapidsai/legate-boost \
    --name "legate-boost-conda-cuda${RAPIDS_CUDA_VERSION}-amd64-py${PYTHON_VERSION}" \
    "${RUN_ID}"

rapids-dependency-file-generator \
  --output conda \
  --file-key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" \
| tee /tmp/env.yaml

conda env create \
    --yes \
    --file /tmp/env.yaml \
    --name test-env

conda install \
  --name test-env \
  --channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
  --channel legate \
  --channel conda-forge \
    'legate-boost=*=*_cpu'

source activate test-env

# run all the CPU tests
ci/run_pytests_cpu.sh

# or, just run one set of test cases
legate \
    --sysmem 28000 \
    --module pytest \
    legateboost/test/test_estimator.py::test_regressor \
    -sv \
    --durations=0
jameslamb commented 3 weeks ago

I think that this is mostly working and ready for review!

Opening it up and tagging reviewers... I need some help.

All of the tests using the GPU variant, on a system with a GPU, are passing (build link) 🎉

However, several tests are failing with the CPU variant, on a CPU-only system.

24 failed, 163 passed, 7 skipped, 2 xfailed, 6 xpassed, 2738 warnings in 350.72s

(build link).

Those tests are mostly failing like this:

E   IndexError: Library legateboost does not have task 10

There are more detailed stacktraces available at that link.

If it helps, here's one specific test case that's failing:

legate \
    --sysmem 28000 \
    --module pytest \
    'legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1]' \
    -sv \
    --durations=0
full trace (click me) ```text ____________________________________________________________________________ test_regressor[base_models3-squared_error-1] ____________________________________________________________________________ num_outputs = 1, objective = 'squared_error', base_models = (,) @pytest.mark.parametrize("num_outputs", [1, 5]) @pytest.mark.parametrize("objective", ["squared_error", "normal", "quantile"]) @pytest.mark.parametrize( "base_models", [ (lb.models.Tree(max_depth=5),), (lb.models.Linear(),), (lb.models.Tree(max_depth=1), lb.models.Linear()), (lb.models.KRR(),), ], ) def test_regressor(num_outputs, objective, base_models): if objective == "quantile" and num_outputs > 1: pytest.skip("Quantile objective not implemented for multi-output") np.random.seed(2) X = np.random.random((100, 10)) y = np.random.random((X.shape[0], num_outputs)) eval_result = {} model = lb.LBRegressor( n_estimators=20, objective=objective, random_state=2, learning_rate=0.1, base_models=base_models, > ).fit(X, y, eval_result=eval_result) legateboost/test/test_estimator.py:83: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/legateboost.py:703: in fit return super().fit( ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/legateboost.py:426: in fit return self._partial_fit(X, y, sample_weight, eval_set, eval_result) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/legateboost.py:251: in _partial_fit .fit(X, g, h) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:190: in fit return self._fit_components(X, g, h) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:177: in _fit_components return self._direct_solve(X, g, h) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:97: in _direct_solve K_nm = self._apply_kernel(X) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:93: in _apply_kernel return self.rbf_kernel(X, self.X_train) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:162: in rbf_kernel return rbf(D_2, self.sigma) ../conda/envs/test-env/lib/python3.1/site-packages/legateboost/models/krr.py:27: in rbf task = get_legate_runtime().create_auto_task(user_context, user_lib.cffi.RBF) runtime.pyx:126: in legate.core._lib.runtime.runtime.Runtime.create_auto_task ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E IndexError: Library legateboost does not have task 10 runtime.pyx:146: IndexError ========================================================================================== warnings summary ========================================================================================== legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1] /opt/legate-boost/legateboost/test/test_estimator.py:73: Warning: Seeding the random number generator with a non-constant value inside Legate can lead to undefined behavior and/or errors when the program is executed with multiple ranks. np.random.seed(2) legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1] /opt/conda/envs/test-env/lib/python3.1/site-packages/scipy/special/_lambertw.py:149: RuntimeWarning: cuNumeric has not implemented _lambertw.__call__ and is falling back to canonical NumPy. You may notice significantly decreased performance for this function call. return _lambertw(z, k, tol) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================= slowest durations ========================================================================================== 0.12s call legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1] (2 durations < 0.005s hidden. Use -vv to show these durations.) ====================================================================================== short test summary info ======================================================================================= FAILED legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1] - IndexError: Library legateboost does not have task 10 =================================================================================== 1 failed, 2 warnings in 1.67s ==================================================================================== ```

And a smaller reproducible example:

import numpy as np
import legateboost as lb

np.random.seed(2)
X = np.random.random((100, 10))
y = np.random.random((X.shape[0], 1))
eval_result = {}
model = lb.LBRegressor(
    n_estimators=20,
    objective="squared_error",
    random_state=2,
    learning_rate=0.1,
    base_models=(lb.models.KRR(),),
).fit(X, y, eval_result=eval_result)

Here's where in legate-core that error message comes from:

https://github.com/nv-legate/legate.core/blob/d9171e5b4d1a072ae897b412b5cb9eb4c5b62f76/legate/core/context.py#L164-L173

jakirkham commented 2 weeks ago

^ @manopapad do you know who would be able to help with the test failures James found above?

trivialfis commented 2 weeks ago

I think you need to share this with the legate team. The error seems internal to legate/legion.

manopapad commented 2 weeks ago

LegateBoost is not defining a CPU variant for the RBF task, but is invoking it even in CPU-only codepaths. The fix might simply be adding a rbf.cc with similar contents to the existing rbf.cu, but I see that similar unary operators like ErfTask are similarly missing CPU variants, so this might be by design.

@RAMitchell would know if this is just an oversight, or the corresponding test(s) are not meant to run w/o GPUs.

RAMitchell commented 2 weeks ago

The CPU versions were being compiled before AFAIK. I've tried to build it to generate a unary task for GPU and cpu based on a c++ lambda.

manopapad commented 2 weeks ago

Oh I see the problem. You are doing the register_tasks call in a __attribute__((constructor)) function in a .cu file. If you do a CPU-only build those files are skipped, so the registration never happens. Previously you were probably doing a GPU-enabled build, and just running it w/o GPUs. You need to move these registration callbacks to a .cc file, to make sure they get compiled (and executed at initialization time) even on CPU-only builds.

jameslamb commented 2 weeks ago

Thanks very much @manopapad @RAMitchell @trivialfis ! I've pushed a patch @RAMitchell sent me in https://github.com/rapidsai/legate-boost/pull/129/commits/7e5ba6d51fb63cd78805a08109b1e4e8cb4b82c3.

Looks like now just 11 tests are failing with the CPU-only varient.

11 failed, 176 passed, 7 skipped, 2 xfailed, 6 xpassed, 1597 warnings in 242.23s (0:04:02)

(build link)

RAMitchell commented 2 weeks ago

Noy sure why that fixed the rbf but not the special functions. I'll take another look tomorrow.

RAMitchell commented 2 weeks ago

There are a bunch of warnings in the builds that should also perhaps be addressed. /opt/conda/lib/python3.10/site-packages/conda_build/environ.py:538: UserWarning: The environment variable 'AWS_ACCESS_KEY_ID' specified in script_env is undefined.

RAMitchell commented 2 weeks ago

Interestingly the python 3.10 build is picking up legate-core=24.06.00 and python 3.11 and 3.12 are picking up legate-core=24.06.01. So I need to update some of the C++ code to get it building with 24.06.01.

jameslamb commented 2 weeks ago

There are a bunch of warnings in the builds that should also perhaps be addressed. /opt/conda/lib/python3.10/site-packages/conda_build/environ.py:538: UserWarning: The environment variable 'AWS_ACCESS_KEY_ID' specified in script_env is undefined.

Thanks, I can address these in a follow-up.

Interestingly the python 3.10 build is picking up legate-core=24.06.00 and python 3.11 and 3.12 are picking up legate-core=24.06.01. So I need to update some of the C++ code to get it building with 24.06.01.

Oh great, didn't know that 24.06.01 had been released! There are several things we can hopefully simplify away now that that's out... will push some follow-up PRs.

LGTM. I can't really say too much as it's outside my expertise, but we should definitely go ahead and then fix any issues later.

Alright thanks so much for the fixes you pushed here!

Yeah there's plenty to clean up here, but I'll merge this now so that can be done in smaller and more focused PRs.