rapidsai / build-planning

Tracking for RAPIDS-wide build tasks
https://github.com/rapidsai
0 stars 4 forks source link

Remove usage of the NumPy C API #41

Closed vyasr closed 2 months ago

vyasr commented 5 months ago

RAPIDS currently makes use of the NumPy C API in a handful of places, generally in Cython code. The NumPy C API is generally quite good and has remained stable, making it easy to work with. However, it does introduce additional build and packaging complexity that would be nice to avoid. With minimal changes to RAPIDS code, we should be able to remove numpy as a build dependency entirely, which may simplify our builds and also saves us from needing to rebuild packages at all when numpy 2.0 is released. If we were getting a lot of value out of the C API the calculus might be different, but in practice our usage of it is very minimal and can generally be avoided. I propose that we expend a little bit of development effort to stop relying on the NumPy C API altogether. This will help us on two fronts: 1) we'll more easily be able to support multiple major versions of NumPy (see #38) since we only have to worry about Python compatibility, not C compatibility; and more importantly 2) we won't have to worry about NumPy C APIs when considering if we can use the Python limited API to produce a single package across Python versions (will open a separate issue for that next). The latter is the more important piece here, since as of this writing the numpy C API is not compatible with the Python limited API based on the author's current experimentation.

The changes required basically boil down to two things:

  1. cudf/cuspatial: cudf and cuspatial both use the C API transitively only, via pyarrow. There is no direct usage of the C API. Therefore, the cudf/cuspatial piece of this issue will be addressed when rapidsai/cudf#15193 is completed.
  2. ucxx: ucxx uses the C API to expose host buffers to other APIs. This usage should be possible to remove by directly implementing the buffer protocol on a custom object. It will require a bit of extra work, but should be easy to maintain going forward.
rgommers commented 5 months ago

The latter is the more important piece here, since as of this writing the numpy C API is not compatible with the Python limited API based on the author's current experimentation.

This is correct, NumPy uses too much of the CPython C API to make supporting the limited API feasible any time soon.

vyasr commented 5 months ago

Thanks for confirming @rgommers! RAPIDS would definitely benefit from producing abi3 wheels, and since we don't really deal with host data buffers there's not really much reason for us to need the numpy C API, so I think it's worth a small investment from us to remove numpy as a build dep.

seberg commented 5 months ago

While it is correct that NumPy will keep using the full API, of course. That isn't a limitation on downstream. The NumPy headers use the limited API for only a few things and Matti added a test for that. Specifically:

Yes, limited API supported isn't tested or used really, but problems shouldn't be confined to macros, so since including the headers works (and is tested) I wouldn't expect further problems.

EDIT: I forgot to add that the necessary disabling of the above two bullets was missing from the headers before NumPy 2.0 meaning that you must compile with NumPy 2. But this shouldn't be a limitation in practice.


That doesn't mean I don't agree with the sentiment: I think a lot or even all of NumPy C-API use is probably simply unnecessary and fewer dependencies are great.

vyasr commented 5 months ago

Thanks for clarifying that. Yeah, we only need to care about the numpy headers on our end. I'll update this thread based on how much effort I see removing usage of the NumPy C API taking us.

vyasr commented 5 months ago

Here's the ucxx removal. Not sure I got everything right, but I don't think we need a whole lot more than what's in there.

jakirkham commented 2 months ago

Given the UCXX piece has been addressed and the remaining work is in cuDF ( https://github.com/rapidsai/cudf/issues/15193 ), should we go ahead and close this out?

vyasr commented 2 months ago

Sure

jakirkham commented 2 months ago

There was some NumPy C API usage in cuCIM via pybind11

This has now been removed with PR: https://github.com/rapidsai/cucim/pull/751

vyasr commented 1 month ago

With rapidsai/cudf#16640 the cudf transitive usage via pyarrow has also been removed.