Closed RUrlus closed 9 months ago
Ha @ymwdalex, @stephanecollot,
As discussed in #92 I've worked on the package which turned into a fairly complete refactor. I tagged it as v1.0.0rc0 as there a number of API changes.
Keep to hear your thoughts!
@ymwdalex @stephanecollot Let us know if you are interested in reviewing but haven't had the time yet.
Otherwise we'll take over maintenance entirely and merge it in. I am off at the end of week and keen to get this merged before then.
@RUrlus thanks for the work! Sorry I have no time to review. I am fine for you and the ING team to take over the maintenance.
Changes
The primary changes are:
Python API
This PR introduces a number of significant changes.
awesome_cossim_topn
has been deprecated and is superseded withsp_matmul_topn
The reasoning here is that the function has little to do with cosine similarity other than being a common use for it. By disabling of the
lower_bound
(nowthreshold
) by default and adding support for integers the function is now capable of generic sparse matrix multiplication and top-n selection. Additionallysp_matmul_topn
returns thetop-n
values per row in the order they would be if you performed normal multiplication, i.e.sp_matmul_topn(A, B, B.shape[1]) == sp_matmul(A, B) == A.dot(B)
.Backwards compatibility
Backwards compatibility has been (largely) retained for the deprecated
awesome_cossim_topn
. Migrating tosp_matmul_topn
is a matter of setting different than default settings forthreshold
andsort
. Only thereturn_best_ntop
option has been removed as this can be easily determined on the resulting matrix and the need for settingthreshold
for memory pressure purposes is no longer needed.Build system
The Cython bindings have been a major pain when looking at the issues, this is resolved by switching to Nanobind. Nanobind also opens the door the future GPU support (if desired) as it uses the DLPACK protocol. CMake is becoming the de-facto standard for C++ and Scikit-build-core is actively being developed by a core-contributor of Pybind11.
Parallelisation is now handled by OpenMP which significantly reduces the maintenance burden and resolves buffer-overruns hiding in the previous threading implementation. For ease-of-use we ship the OpenMP binary in the wheels. Note, this is not without risk/issue but given the size of the user-base probably OK for now. People that encounter issues can recompile from source or download the OpenMP free wheels on the release page.
Performance improvements
Reduced memory complexity
The biggest performance improvement is a significantly reduced memory footprint due to using a Max-Heap of constant size
top_n
. The previous approach (for a given row) collected all values above the threshold and than selected the top-n values. For relatively dense/large matrices or a low threshold, this could result in the collection of many values and significant memory pressure. Benchmarking shows that the Max-Heap is also faster for top-n values up to at least 100.Additionally, when
threshold=None
we pre-compute the number of non-zero values (given top-n) and allocate the right size. This does incurs a performance penalty but users that favour runtime over memory pressure can avoid this overhead by setting the threshold tonp.finfo(A.data.dtype).min
ornp.iinfor(A.data.dtype)
for integers. For the single threaded path we added thedensity
parameter which is used to determine the pre-allocation size whenthreshold
is notNone
,n_alloc = ceil(n_rows * top_n * density)
. This allows users to reduce the pre-allocated memory when they have a good expectation of the result density, being wrong incurs a copy penalty if they vectors have to resize. The multithreaded implementation allocates the right amount before copying the results from the thread blocks.Benchmarking
The new candidate version has been benchmarked to the previous implementation over word-based TF-IDF matrices of American company names from the EDGAR company database.
richbench
was used to measure the performance using the matrix product(100_000, 193190) x (193190, 100_000)
repeated 30 times. Seebech/
for detailed results and the code.General findings are that:
top_n = 1000
The biggest performance difference is observed for
threshold=0.0
,top_n=10
andn_threads=8
where 1.0 (0.501s) is on average 2.4 times faster than v0.3.6 (1.215s) v1.0.0 is on average 1.195 faster (median: 1.06) over all the benchmark settings.It appears that there is a buffer-overrun hiding in the v0.3.6 multithreaded implementation that is hidden by the large over-allocation for C. Additionally, the implementation appears to be significantly slower than the OpenMP implementation used in v1.0
Memory profiling is still an open item as
memory_profiler
does not correctly track the memory allocated by v1.0 andmemray
requires a symbolized debug build of Python.Release notes
API
ENH: Add support for 32 and 64bit integers
ENH: Add
sp_matmul
parallelised implementation of CSR matrix multiplicationBLD: Add support for CPython 3.12, closes #92
API:
awesome_cossim_topn
is superseded withsp_matmul_topn
.API:
awesome_cossim_topn
has been deprecated and will be removed in a future version.API:
ntop
parameter has been renamed totop_n
API:
lower_bound
parameter has been renamed tothreshold
API:
use_threads
andn_jobs
parameters have been combined inton_threads
API:
return_best_ntop
parameter has been removedAPI:
test_nnz_max
parameter has been removedAPI: default parameter value for
threshold
changed from0.0
toNone
(disabled)API: default parameter value for
sort
changed toFalse
Internal