Closed jameslamb closed 2 weeks ago
I think that this is mostly working and ready for review!
Opening it up and tagging reviewers... I need some help.
All of the tests using the GPU variant, on a system with a GPU, are passing (build link) 🎉
However, several tests are failing with the CPU variant, on a CPU-only system.
24 failed, 163 passed, 7 skipped, 2 xfailed, 6 xpassed, 2738 warnings in 350.72s
(build link).
Those tests are mostly failing like this:
E IndexError: Library legateboost does not have task 10
There are more detailed stacktraces available at that link.
If it helps, here's one specific test case that's failing:
legate \
--sysmem 28000 \
--module pytest \
'legateboost/test/test_estimator.py::test_regressor[base_models3-squared_error-1]' \
-sv \
--durations=0
And a smaller reproducible example:
import numpy as np
import legateboost as lb
np.random.seed(2)
X = np.random.random((100, 10))
y = np.random.random((X.shape[0], 1))
eval_result = {}
model = lb.LBRegressor(
n_estimators=20,
objective="squared_error",
random_state=2,
learning_rate=0.1,
base_models=(lb.models.KRR(),),
).fit(X, y, eval_result=eval_result)
Here's where in legate-core
that error message comes from:
^ @manopapad do you know who would be able to help with the test failures James found above?
I think you need to share this with the legate team. The error seems internal to legate/legion.
LegateBoost is not defining a CPU variant for the RBF task, but is invoking it even in CPU-only codepaths. The fix might simply be adding a rbf.cc
with similar contents to the existing rbf.cu
, but I see that similar unary operators like ErfTask
are similarly missing CPU variants, so this might be by design.
@RAMitchell would know if this is just an oversight, or the corresponding test(s) are not meant to run w/o GPUs.
The CPU versions were being compiled before AFAIK. I've tried to build it to generate a unary task for GPU and cpu based on a c++ lambda.
Oh I see the problem. You are doing the register_tasks
call in a __attribute__((constructor))
function in a .cu
file. If you do a CPU-only build those files are skipped, so the registration never happens. Previously you were probably doing a GPU-enabled build, and just running it w/o GPUs. You need to move these registration callbacks to a .cc
file, to make sure they get compiled (and executed at initialization time) even on CPU-only builds.
Thanks very much @manopapad @RAMitchell @trivialfis ! I've pushed a patch @RAMitchell sent me in https://github.com/rapidsai/legate-boost/pull/129/commits/7e5ba6d51fb63cd78805a08109b1e4e8cb4b82c3.
Looks like now just 11 tests are failing with the CPU-only varient.
11 failed, 176 passed, 7 skipped, 2 xfailed, 6 xpassed, 1597 warnings in 242.23s (0:04:02)
Noy sure why that fixed the rbf but not the special functions. I'll take another look tomorrow.
There are a bunch of warnings in the builds that should also perhaps be addressed.
/opt/conda/lib/python3.10/site-packages/conda_build/environ.py:538: UserWarning: The environment variable 'AWS_ACCESS_KEY_ID' specified in script_env is undefined.
Interestingly the python 3.10 build is picking up legate-core=24.06.00 and python 3.11 and 3.12 are picking up legate-core=24.06.01. So I need to update some of the C++ code to get it building with 24.06.01.
There are a bunch of warnings in the builds that should also perhaps be addressed. /opt/conda/lib/python3.10/site-packages/conda_build/environ.py:538: UserWarning: The environment variable 'AWS_ACCESS_KEY_ID' specified in script_env is undefined.
Thanks, I can address these in a follow-up.
Interestingly the python 3.10 build is picking up legate-core=24.06.00 and python 3.11 and 3.12 are picking up legate-core=24.06.01. So I need to update some of the C++ code to get it building with 24.06.01.
Oh great, didn't know that 24.06.01 had been released! There are several things we can hopefully simplify away now that that's out... will push some follow-up PRs.
LGTM. I can't really say too much as it's outside my expertise, but we should definitely go ahead and then fix any issues later.
Alright thanks so much for the fixes you pushed here!
Yeah there's plenty to clean up here, but I'll merge this now so that can be done in smaller and more focused PRs.
Contributes to #115
Adds conda builds.
Notes for Reviewers
This is not even close to ready for review. Just putting it up here to show the direction I'm going. This diff will get smaller as other PRs are merged.
Design choices
using GitHub Artifact store instead of RAPIDS S3 bucket
details (click me)
RAPIDS is looking to eventually move away from its S3-based strategy for storing and hosting CI artifacts. I wanted to use this as a test case for that. GitHub Artifacts is also nice because, unlike https://downloads.rapids.ai/, it's on the public internet... so any outside-of-NVIDIA contributors can see their CI artifacts.building
*_gpu
and*_cpu*
variantsdetails (click me)
`legate-core` and `cunumeric` do something very similar. `legate-core` / `cunumeric` did not want to add a `*_gpu` portion to the build string when I proposed it, because they considered changing the build string format to be an unacceptable breaking change. Since these packages for `legate-boost` are brand new, I think we *should* do that here, to achieve the following semantics: ```shell # only installs the GPU variant (or fails if it's not installable) conda install legate-core=24.06=*_gpu # only installs the CPU variant (or fails if it's not installable) conda install legate-core=24.06=*_cpu # installs the GPU variant if CUDA is detected (look for '__cuda' in the output of 'conda info'), # otherwise installs the CPU variant conda install legate-core ```Building and testing in separate environments
details (click me)
On `main`, this project's CI is building and testing in the same environment. This can lead to issues like those described in #144, where packaging problems are silently missed. Building in one place and testing in another has a few nice benefits: * less GPU runner utilization - *builds happen on CPU-only runner* * CI is more likely to be able to detect packaging problems - *tests just install pre-built packages and all the dependencies they declare... could catch things like "this package needs `scipy` at runtime but it isn't in the package dependencies"* * greater confidence in CPU support - *CPU tests run on a machine that doesn't have CUDA or a GPU* - *installation of the CPU-only variant tested there too* * tests can run concurrently - *CPU tests and GPU tests run in physically separate instances, so can happen at the same time, which increases the amount of testing that can be done for any fixed amount of CI time*How to test this locally
See the notes added to
contributing.md
in the diff here (which also includes more explanation of these commands).In short: