Release candidate v1.0.0

RUrlus commented 9 months ago

Changes

The primary changes are:

Redesigned Python API
New build system
Performance improvements

Python API

This PR introduces a number of significant changes.

awesome_cossim_topn has been deprecated and is superseded with sp_matmul_topn

The reasoning here is that the function has little to do with cosine similarity other than being a common use for it. By disabling of the lower_bound (now threshold) by default and adding support for integers the function is now capable of generic sparse matrix multiplication and top-n selection. Additionally sp_matmul_topn returns the top-n values per row in the order they would be if you performed normal multiplication, i.e. sp_matmul_topn(A, B, B.shape[1]) == sp_matmul(A, B) == A.dot(B).

Backwards compatibility

Backwards compatibility has been (largely) retained for the deprecated awesome_cossim_topn. Migrating to sp_matmul_topn is a matter of setting different than default settings for threshold and sort. Only the return_best_ntop option has been removed as this can be easily determined on the resulting matrix and the need for setting threshold for memory pressure purposes is no longer needed.

Build system

Nanobind (C++) bindings
CMake + Scikit-build-core build system

The Cython bindings have been a major pain when looking at the issues, this is resolved by switching to Nanobind. Nanobind also opens the door the future GPU support (if desired) as it uses the DLPACK protocol. CMake is becoming the de-facto standard for C++ and Scikit-build-core is actively being developed by a core-contributor of Pybind11.

Bundled OpenMP for parallelisation

Parallelisation is now handled by OpenMP which significantly reduces the maintenance burden and resolves buffer-overruns hiding in the previous threading implementation. For ease-of-use we ship the OpenMP binary in the wheels. Note, this is not without risk/issue but given the size of the user-base probably OK for now. People that encounter issues can recompile from source or download the OpenMP free wheels on the release page.

Performance improvements

Reduced memory complexity

The biggest performance improvement is a significantly reduced memory footprint due to using a Max-Heap of constant size top_n. The previous approach (for a given row) collected all values above the threshold and than selected the top-n values. For relatively dense/large matrices or a low threshold, this could result in the collection of many values and significant memory pressure. Benchmarking shows that the Max-Heap is also faster for top-n values up to at least 100.

Additionally, when threshold=None we pre-compute the number of non-zero values (given top-n) and allocate the right size. This does incurs a performance penalty but users that favour runtime over memory pressure can avoid this overhead by setting the threshold to np.finfo(A.data.dtype).min or np.iinfor(A.data.dtype) for integers. For the single threaded path we added the density parameter which is used to determine the pre-allocation size when threshold is not None, n_alloc = ceil(n_rows * top_n * density). This allows users to reduce the pre-allocated memory when they have a good expectation of the result density, being wrong incurs a copy penalty if they vectors have to resize. The multithreaded implementation allocates the right amount before copying the results from the thread blocks.

Benchmarking

The new candidate version has been benchmarked to the previous implementation over word-based TF-IDF matrices of American company names from the EDGAR company database. richbench was used to measure the performance using the matrix product (100_000, 193190) x (193190, 100_000) repeated 30 times. See bech/ for detailed results and the code.

General findings are that:

v1.0 generally outperforms v0.3.6
v0.3.6 is generally faster when top_n = 1000
performance difference is largest for multithreaded implementation
performance converges as the threshold increases

The biggest performance difference is observed for threshold=0.0, top_n=10 and n_threads=8 where 1.0 (0.501s) is on average 2.4 times faster than v0.3.6 (1.215s) v1.0.0 is on average 1.195 faster (median: 1.06) over all the benchmark settings.

It appears that there is a buffer-overrun hiding in the v0.3.6 multithreaded implementation that is hidden by the large over-allocation for C. Additionally, the implementation appears to be significantly slower than the OpenMP implementation used in v1.0

Memory profiling is still an open item as memory_profiler does not correctly track the memory allocated by v1.0 and memray requires a symbolized debug build of Python.

Release notes

API

ENH: Add support for 32 and 64bit integers
ENH: Add sp_matmul parallelised implementation of CSR matrix multiplication
BLD: Add support for CPython 3.12, closes #92
API: awesome_cossim_topn is superseded with sp_matmul_topn.
API: awesome_cossim_topn has been deprecated and will be removed in a future version.
API: ntop parameter has been renamed to top_n
API: lower_bound parameter has been renamed to threshold
API: use_threads and n_jobs parameters have been combined into n_threads
API: return_best_ntop parameter has been removed
API: test_nnz_max parameter has been removed
API: default parameter value for threshold changed from 0.0 to None (disabled)
API: default parameter value for sort changed to False

Internal

FIX: [C++] Resolve unneeded memory allocation that solved hidden buffer-overrun in multithreaded implementation
BLD: Switch to pyproject.toml based setup (scikit-build-core)
BLD: [C++] Switch to Nanobind bindings
CHG: [C++] Switch to OpenMP for multithreading
ENH: [C++] Use MaxHeap to collect top-n results over vector of candidates

RUrlus commented 9 months ago

Ha @ymwdalex, @stephanecollot,

As discussed in #92 I've worked on the package which turned into a fairly complete refactor. I tagged it as v1.0.0rc0 as there a number of API changes.

Keep to hear your thoughts!

RUrlus commented 9 months ago

@ymwdalex @stephanecollot Let us know if you are interested in reviewing but haven't had the time yet.

Otherwise we'll take over maintenance entirely and merge it in. I am off at the end of week and keen to get this merged before then.

ymwdalex commented 9 months ago

@RUrlus thanks for the work! Sorry I have no time to review. I am fine for you and the ING team to take over the maintenance.

ing-bank / sparse_dot_topn