mlpack / mlpack

mlpack: a fast, header-only C++ machine learning library
https://www.mlpack.org/
Other
5.11k stars 1.61k forks source link

Replace Cython-based Arma -- Numpy conversions with Pybind11 + CARMA #2982

Closed RUrlus closed 1 year ago

RUrlus commented 3 years ago

What is the desired addition or change?

Current conversion from and to Numpy arrays is done with Cython which requires a copy both in and out on Windows. Armadillo introduced foreign/alien allocators in version 10.5.2 that enable usage, for example, Numpy's (de)allocator allowing better interoperability with CARMA. Using Numpy's allocator/deallocator prevents the need to copy in and out on Windows.

CARMA is a header-only library that extends Pybind11 with bilateral conversions between Numpy and Armadillo.

@rcurtin mentioned that this could be a good approach for MLPack's Python bindings. Before I start working on a PR I wanted to discuss compatibility and the potential impact of a change.

If you believe the benefits are worth the stricter requirements I would be happy to start working on Pybind11 + CARMA based conversions.

Changes

Change to minimum required versions dependencies:

The armadillo version is a significant version increase compared to the current >= 8.400.4. I don't think the CMake version is a big deal as it can easily be installed with pip which you'll have anyway when building the bindings. Numpy 1.14 is already a few years old so that shouldn't be too much of an issue as well.

C-order vs F-order

I noticed that the current conversion enforces C-order whereas CARMA, optionally, enforces F-order as Armadillo is column-major. I am not sure why the C-order was chosen, but if it is required, it isn't too much work to make CARMA configurable to return C-order Numpy arrays. By default CARMA copies to F-order on the conversion from Numpy to Armadillo if needed but that can be turned off at compile time.

Stricter condition checks on Numpy arrays

By default CARMA will copy/convert incoming arrays if one of the below is true:

  1. memory is not aligned
  2. memory is not writeable
  3. Numpy does not own the data
  4. array has ndim >= 2 and memory is not F-contiguous

Options 3 and 4 can be, independently, turned off at compile-time. Current implementation of the bindings doesn't seem to check writeable, CARMA provides a view on the array which doesn't consider the writeable flag.

Benefits

Pybind11 is a popular C++ 11 library for which bindings can be automatically generated using Binder from RosettaCommons. This means that the bindings are all C++ so easier to use with templated code.

Consistent behaviour across platforms, current approach requires a copy in and out on Windows.

Pybind11 and CARMA support all numerical types supported by Armadillo and Numpy rather than only int and double

shrit commented 3 years ago

@RUrlus thanks for putting effort to design caram, it is really nice project, I personally think it is very nice to have carma to convert from Numpy matrices to Armadillo. The best part is that carma is header-only.

Moving from armadillo 8.4 to 10.5 might be a really hard thing to do since it might require a lot of users to download and install armadillo manually. This might be fine for part of the users but not for everyone, and eventually leaving a part of the users behind in this case. However, if the community is ready to move the 8.4 to 10.5 then, we can ignore the solutions proposed further.

However, there are workarounds that can be proposed but we need to have more ideas about how carma is going to be used with mlpack, is carma going to be downloaded by the users? included in mlpack source code? is it distributed via the system package manager?

Also, we need to know if carma is going to be used exclusively for python binding? (which I think is the case) in this scenario, the user is writing in Python, and the conversion from Numpy to armadillo will be hidden from the user and will be done internally inside mlpack. Therefore, instead of moving the dependency from 8.4 to 10.5 We can for only this case distribute armadillo 10.5 headers with carma using the autodownloaders, I am not sure if this the best solution but this can be possible.

In addition, I rarely use python so I have no idea about the dependency management on that side, but I believe that most of the users are using anaconda or maybe not, so I am not sure if they have armadillo latest version in this case too.

These are some thoughts I had in my mind, let me know what do you think.

RUrlus commented 3 years ago

@shrit Thanks for the kind words and the quick response.

However, there are workarounds that can be proposed but we need to have more ideas about how carma is going to be used with mlpack, is carma going to be downloaded by the users? included in mlpack source code? is it distributed via the system package manager?

I would suggest to include carma in mlpack, either directly or downloaded/installed during the build. It's light enough for either. Download could happen through CMake or similar to Pybind11, I could look into making it into a pip package. From a user's perspective it's better to avoid having to manually install different packages and with Python's pip or anaconda it's easy enough to install requirements automatically.

Also, we need to know if carma is going to be used exclusively for python binding? (which I think is the case) in this scenario, the user is writing in Python, and the conversion from Numpy to armadillo will be hidden from the user and will be done internally inside mlpack. Therefore, instead of moving the dependency from 8.4 to 10.5 We can for only this case distribute armadillo 10.5 headers with carma using the autodownloaders, I am not sure if this the best solution but this can be possible.

Yes it's only relevant to the conversions in the Python bindings, the user should never have to know it's there. carma CMake build downloads and creates (header-only) interface targets for armadillo and pybind11 is they don't exist. We could use this to grab the 10.5 headers only for the bindings.

In addition, I rarely use python so I have no idea about the dependency management on that side, but I believe that most of the users are using anaconda or maybe not, so I am not sure if they have armadillo latest version in this case too.

Python's dependency management is a bit split, a large group, particularly on Windows, tend to use anaconda. Pip is the most common dependency manager and can be used through anaconda. It seems that the armadillo version on anaconda is 9.900.5 so not too far off.

rcurtin commented 3 years ago

I noticed that the current conversion enforces C-order whereas CARMA, optionally, enforces F-order as Armadillo is column-major.

Actually the reason for this is to match the conventions used in Python machine learning libraries, where an individual row represents a data point. In mlpack, due to the fact that Armadillo is column-major, we represent a point as a column. So we look for C-order, and if we get it, then we just "logically" transpose the matrix (e.g. change the number of rows and columns), and now we have a column-major matrix with data points as columns, without having had to change anything. :)

The armadillo version is a significant version increase compared to the current >= 8.400.4.

Hmm, there are a couple possibilities here---the first is to just do the current solution of copying matrices on Windows when versions less than 10.5 are detected. But I am not sure if that will require changes to CARMA.

We can for only this case distribute armadillo 10.5 headers with carma using the autodownloaders, I am not sure if this the best solution but this can be possible.

Actually we have a little more control here. The vast, vast majority of users will install prebuilt Python bindings from pip or conda, and when we are building wheels for PyPI or packages for conda, we have control over the environment.

Here is the script that builds for PyPI:

https://github.com/mlpack/mlpack-wheels/blob/master/config.sh

You can see that we have control over the Armadillo and CMake version. So, we could leave the base version requirements for mlpack the same, but then if you have set -DBUILD_PYTHON_BINDINGS=ON, then we enforce a more rigid version check in src/mlpack/bindings/python/CMakeLists.txt.

I bet it would not be hard to update the anaconda Armadillo version---usually the language-specific package managers move pretty quickly.

By default CARMA will copy/convert incoming arrays if one of the below is true:

  1. memory is not aligned
  2. memory is not writeable
  3. Numpy does not own the data
  4. array has ndim >= 2 and memory is not F-contiguous

Can you describe more about what (3) means? In many cases a Python user may call, e.g., mlpack.random_forest() or some other binding, and inside that binding (in C++) we may allocate a matrix with Armadillo that is meant to be passed back to the user as a numpy matrix. Can we avoid a copy in that case? (The current code does, unless, as you pointed out, it's on Windows.)

Pybind11 is a popular C++ 11 library for which bindings can be automatically generated using Binder from RosettaCommons. This means that the bindings are all C++ so easier to use with templated code.

I have looked into PyBind11 and it is really nice! I think that using it, instead of Cython, could significantly simplify our Python binding build process. Right now, we actually generate a bunch of C++ code and pyx code, and then pass it all into setuptools in some awful way to convince setuptools to use its own Cython support to correctly compile the code into a .so that can then be loaded by Python. If this can all be done without setuptools now, that would be... amazing. :rocket:

If we shift away from Cython entirely, I'm not quite sure what we have to do to handle the other type mappings that we have, but, luckily, the set of types is pretty small... size_t, bool, double, std::vector<int>, std::vector<std::string>, std::string, Armadillo matrix/column vector/row vector types, and std::tuple<data::DatasetInfo, arma::mat>, which is a categorical matrix (e.g. just a matrix along with some information that indicates which rows in the matrix correspond to categorical variables).

If you believe the benefits are worth the stricter requirements I would be happy to start working on Pybind11 + CARMA based conversions.

Personally, I think this would be great! But we should probably let the proposal sit for a little while to see what others think too. :+1:

RUrlus commented 3 years ago

Actually the reason for this is to match the conventions used in Python machine learning libraries, where an individual row represents a data point. In mlpack, due to the fact that Armadillo is column-major, we represent a point as a column. So we look for C-order, and if we get it, then we just "logically" transpose the matrix (e.g. change the number of rows and columns), and now we have a column-major matrix with data points as columns, without having had to change anything. :)

That's a good point, I can just check whether the array has more rows than cols with C-order and transpose rather than assume people have already transposed the array.

The armadillo version is a significant version increase compared to the current >= 8.400.4.

Hmm, there are a couple possibilities here---the first is to just do the current solution of copying matrices on Windows when versions less than 10.5 are detected. But I am not sure if that will require changes to CARMA.

Yes this would require a change to CARMA but if really needed we could maintain a separate branch that uses the default Armadillo allocators and copies when on windows. Although I would obviously prefer avoiding this.

Actually we have a little more control here. The vast, vast majority of users will install prebuilt Python bindings from pip or conda, and when we are building wheels for PyPI or packages for conda, we have control over the environment. ... You can see that we have control over the Armadillo and CMake version. So, we could leave the base version requirements for mlpack the same, but then if you have set -DBUILD_PYTHON_BINDINGS=ON, then we enforce a more rigid version check in src/mlpack/bindings/python/CMakeLists.txt. I bet it would not be hard to update the anaconda Armadillo version---usually the language-specific package managers move pretty quickly.

Cool, that makes things a lot easier. We could always open a PR on the anaconda armadillo feedstock.

By default CARMA will copy/convert incoming arrays if one of the below is true:

  1. memory is not aligned
  2. memory is not writeable
  3. Numpy does not own the data
  4. array has ndim >= 2 and memory is not F-contiguous

Can you describe more about what (3) means? In many cases a Python user may call, e.g., mlpack.random_forest() or some other binding, and inside that binding (in C++) we may allocate a matrix with Armadillo that is meant to be passed back to the user as a numpy matrix. Can we avoid a copy in that case? (The current code does, unless, as you pointed out, it's on Windows.)

These checks only occur on the conversion from Numpy to Armadillo. 3) is meant to guard against freeing data, when taking ownership, for which we don't know who owns it and/or how it was allocated. It's a safety check which can be turned off at compile time.

Any data allocated by Armadillo (or Numpy's C-API) can be, and is unless explicitly requested otherwise, returned without a copy by carma.

If we shift away from Cython entirely, I'm not quite sure what we have to do to handle the other type mappings that we have, but, luckily, the set of types is pretty small... size_t, bool, double, std::vector<int>, std::vector<std::string>, std::string, Armadillo matrix/column vector/row vector types, and std::tuple<data::DatasetInfo, arma::mat>, which is a categorical matrix (e.g. just a matrix along with some information that indicates which rows in the matrix correspond to categorical variables).

Pybind11 can handle size_t, bool, double, std::vector<int>, the vector is cast to a list or Numpy array. std::vector<std::string>, std::string I am pretty sure as well, probably through c_str. std::tuple is supported as py::tuple. For data::DatasetInfo we can wrap the class using Pybind11.

CARMA supports Mat, Col, Row, Vec, Cube for all common (non-complex) numeric types. I've not tested/used CARMA with complex types but I think it should work, pybind11 supports it.

Personally, I think this would be great! But we should probably let the proposal sit for a little while to see what others think too. 👍

Sounds good.

rcurtin commented 3 years ago

Yes this would require a change to CARMA but if really needed we could maintain a separate branch that uses the default Armadillo allocators and copies when on windows. Although I would obviously prefer avoiding this.

Yeah, on second thought, let's avoid that by just requiring Armadillo 10.5+ to manually build the Python bindings.

By default CARMA will copy/convert incoming arrays if one of the below is true:

  1. memory is not aligned
  2. memory is not writeable
  3. Numpy does not own the data
  4. array has ndim >= 2 and memory is not F-contiguous

Can you describe more about what (3) means? In many cases a Python user may call, e.g., mlpack.random_forest() or some other binding, and inside that binding (in C++) we may allocate a matrix with Armadillo that is meant to be passed back to the user as a numpy matrix. Can we avoid a copy in that case? (The current code does, unless, as you pointed out, it's on Windows.)

These checks only occur on the conversion from Numpy to Armadillo. 3) is meant to guard against freeing data, when taking ownership, for which we don't know who owns it and/or how it was allocated. It's a safety check which can be turned off at compile time.

Any data allocated by Armadillo (or Numpy's C-API) can be, and is unless explicitly requested otherwise, returned without a copy by carma.

Ahh, ok, when I reread what you originally wrote I think this makes more sense now. From my side the priority is just to ensure, if possible, no copies when passing matrices back and forth between C++ and Python. I think you are saying that's possible, even on Windows (where our current code is not able to avoid a copy). :+1:

Pybind11 can handle size_t, bool, double, std::vector<int>, the vector is cast to a list or Numpy array. std::vector<std::string>, std::string I am pretty sure as well, probably through c_str. std::tuple is supported as py::tuple. For data::DatasetInfo we can wrap the class using Pybind11.

Nice! Actually we can cut corners a little bit for the DatasetInfo; really what that is meant to represent is a matrix containing categorical data, for which the more common representation in Python is going to be a Pandas dataframe. So we already have code to extract the types of a Pandas matrix into a list of dimension types:

https://github.com/mlpack/mlpack/blob/master/src/mlpack/bindings/python/mlpack/matrix_utils.py#L74

and, actually, there is no code for the reverse direction because we don't have any bindings that output a categorical matrix from C++ back into Python.

Anyway, everything you are writing sounds great to me! Let me know if I can help out during the process or clarify any of the existing code. :+1:

RUrlus commented 3 years ago

@rcurtin The issue has been open for a while now without any additional comments/considerations. So I'll start working on this, I expect it will take a bit of time to get an initial PR. I'll reach out on slack(?) when I have questions or to discuss approaches.

rcurtin commented 3 years ago

Sounds good! Our Slack room is bridged to our Matrix room and Gitter room (but currently not IRC ...), so you can use any of those protocols without a problem. :)

conradsnicta commented 3 years ago

@RUrlus

It seems that the armadillo version on anaconda is 9.900.5 so not too far off.

The Anaconda recipe for Armadillo was recently updated to 10.6.2:

conradsnicta commented 3 years ago

A possible alternative option would be to use PyArmadillo (repo: https://gitlab.com/jason-rumengan/pyarma/) instead of going through conversions to/from numpy arrays in order to use mlpack functions and classes.

The native matrices in mlpack are armadillo matrices, so using PyArmadillo would considerably simplify mlpack's interface to Python. More specifically, mlpack functions could be simply exposed as accepting PyArmadillo matrices.

Furthermore, PyArmadillo takes care of linking with OpenBLAS etc on various platforms, so there should be no need for the mlpack Python bindings to worry about that. PyArmadillo matrices are easily convertible to/from numpy arrays.

CC @jason-rumengan

pree-T commented 2 years ago

Is this issue still open? May I work on it? Can someone help me?

RUrlus commented 2 years ago

@pree-T yes it is still open, I've been working on it but I haven't had much time. Currently, writing the binding generator code. When that's done it should be easier for you to join the work

rcurtin commented 2 years ago

@RUrlus thanks for the update---please don't hesitate to ask if you'd like me to explain anything about the binding generator setup. It would be exciting to transition away from Cython (and make builds easier), so I am more than happy to help out where I can. :+1: