ing-bank / sparse_dot_topn

Python package to accelerate the sparse matrix multiplication and top-n similarity selection
Apache License 2.0
399 stars 86 forks source link

Release candidate v1.0.0 #94

Closed RUrlus closed 9 months ago

RUrlus commented 9 months ago

Changes

The primary changes are:

  1. Redesigned Python API
  2. New build system
  3. Performance improvements

Python API

This PR introduces a number of significant changes.

The reasoning here is that the function has little to do with cosine similarity other than being a common use for it. By disabling of the lower_bound (now threshold) by default and adding support for integers the function is now capable of generic sparse matrix multiplication and top-n selection. Additionally sp_matmul_topn returns the top-n values per row in the order they would be if you performed normal multiplication, i.e. sp_matmul_topn(A, B, B.shape[1]) == sp_matmul(A, B) == A.dot(B).

Backwards compatibility

Backwards compatibility has been (largely) retained for the deprecated awesome_cossim_topn. Migrating to sp_matmul_topn is a matter of setting different than default settings for threshold and sort. Only the return_best_ntop option has been removed as this can be easily determined on the resulting matrix and the need for setting threshold for memory pressure purposes is no longer needed.

Build system

The Cython bindings have been a major pain when looking at the issues, this is resolved by switching to Nanobind. Nanobind also opens the door the future GPU support (if desired) as it uses the DLPACK protocol. CMake is becoming the de-facto standard for C++ and Scikit-build-core is actively being developed by a core-contributor of Pybind11.

Parallelisation is now handled by OpenMP which significantly reduces the maintenance burden and resolves buffer-overruns hiding in the previous threading implementation. For ease-of-use we ship the OpenMP binary in the wheels. Note, this is not without risk/issue but given the size of the user-base probably OK for now. People that encounter issues can recompile from source or download the OpenMP free wheels on the release page.

Performance improvements

Reduced memory complexity

The biggest performance improvement is a significantly reduced memory footprint due to using a Max-Heap of constant size top_n. The previous approach (for a given row) collected all values above the threshold and than selected the top-n values. For relatively dense/large matrices or a low threshold, this could result in the collection of many values and significant memory pressure. Benchmarking shows that the Max-Heap is also faster for top-n values up to at least 100.

Additionally, when threshold=None we pre-compute the number of non-zero values (given top-n) and allocate the right size. This does incurs a performance penalty but users that favour runtime over memory pressure can avoid this overhead by setting the threshold to np.finfo(A.data.dtype).min or np.iinfor(A.data.dtype) for integers. For the single threaded path we added the density parameter which is used to determine the pre-allocation size when threshold is not None, n_alloc = ceil(n_rows * top_n * density). This allows users to reduce the pre-allocated memory when they have a good expectation of the result density, being wrong incurs a copy penalty if they vectors have to resize. The multithreaded implementation allocates the right amount before copying the results from the thread blocks.

Benchmarking

The new candidate version has been benchmarked to the previous implementation over word-based TF-IDF matrices of American company names from the EDGAR company database. richbench was used to measure the performance using the matrix product (100_000, 193190) x (193190, 100_000) repeated 30 times. See bech/ for detailed results and the code.

General findings are that:

The biggest performance difference is observed for threshold=0.0, top_n=10 and n_threads=8 where 1.0 (0.501s) is on average 2.4 times faster than v0.3.6 (1.215s) v1.0.0 is on average 1.195 faster (median: 1.06) over all the benchmark settings.

It appears that there is a buffer-overrun hiding in the v0.3.6 multithreaded implementation that is hidden by the large over-allocation for C. Additionally, the implementation appears to be significantly slower than the OpenMP implementation used in v1.0

Memory profiling is still an open item as memory_profiler does not correctly track the memory allocated by v1.0 and memray requires a symbolized debug build of Python.

Release notes

API

Internal

RUrlus commented 9 months ago

Ha @ymwdalex, @stephanecollot,

As discussed in #92 I've worked on the package which turned into a fairly complete refactor. I tagged it as v1.0.0rc0 as there a number of API changes.

Keep to hear your thoughts!

RUrlus commented 9 months ago

@ymwdalex @stephanecollot Let us know if you are interested in reviewing but haven't had the time yet.

Otherwise we'll take over maintenance entirely and merge it in. I am off at the end of week and keen to get this merged before then.

ymwdalex commented 9 months ago

@RUrlus thanks for the work! Sorry I have no time to review. I am fine for you and the ING team to take over the maintenance.