Compute Gaussian kernel performance improvements.

mierzejk commented 4 years ago

densratio.RuLSIF.compute_kernel_Gaussian has been updated with a performance-improved implementation. The sheet comparing the baseline (original) and performance-improved implementations is also available at https://bit.ly/3X7asIm; I hope it is pretty self-explanatory.

The densratio.RuLSIF.set_compute_kernel_target (also available to be imported directly from densratio) accepts one of the following string arguments, and sets the underlying engine to carry out calculations:

numpy - numpy broadcasting optimized. It must be noted the underlying BLAS library (e.g. Intel's MKL) can take advantage of multi threading model.
cpu - numba generalized universal function single thread optimized.
parallel - numba generalized universal function multi thread optimized. Please be advised all threading layer specifics apply.

Because of aforementioned multi threading technicalities, the engine defaults to cpu when numba is available, or numpy otherwise. I do not think adding the numba requirement is the best idea, as it can potentially be not backward compatible with other existing projects already dependent on densratio. The performance-improved densratio.RuLSIF.set_compute_kernel_target implementation returns a numpy.matrix if any of the first two arguments is of the numpy.matrix type. Or it returns and expects a numpy.ndarray, in case future commits replace the deprecated numpy.matrix with just numpy.ndarray.

mierzejk commented 4 years ago

The pull request may, at least partially resolve the following issues: #6 estimate density ratio of large training set and test set and #8 density ratio estimation of high dimension data. According to my tests, both numpy and numba targets can deal with x_list and y_list matrices that consume over 20GB+ altogether, if enough virtual memory is available. The pull request offers prospect of even greater performance improvement for large sets of data by taking advantage of numba cuda target. Yet that would require some extra work, not fully aligned with currently implemented numba.guvectorize approach.

mierzejk commented 4 years ago

A side-note in respect of the performance results: just recently I ran the benchmark with the same densratio_py codebase I have submitted in the following two environments:

My over 6-year-old Dell Precision M4800, Intel Core i9 with 8 cores and 32 GB RAM available, running Ubuntu 18.04.4 LTS.
Virtualized Windows Server 2016, 32 cores and 128 GB RAM available.

And to my surprise, despite the fact all 32 cores were being utilized in Windows environment, the process executed a few times faster on my reportedly less powerful laptop. I am not really sure what the real cause of that is. It might be the operating system itself. But perhaps it is due to the fact I have my laptop setup with regard to PyTorch performance, namely I have built numpy, numba, Cython and mkl from sources by myself. On Windows all packages have been delivered pre-built either by Anacoda or pip. The original benchmark results I attached to the first pull request post were measured in the first environment, i.e. my Dell Precision M4800 running Ubuntu 18.04.4 LTS.

hoxo-m commented 1 year ago

It is the greatest contribution!

hoxo-m / densratio_py

Compute Gaussian kernel performance improvements. #9