Optimize BSS use of FFT with cupy, speed up of up to 3x for full tracks

sevagh commented 3 years ago

Hello, I have been working on some potential performance optimizations for the BSS evaluation (which is rather slow/compute intensive for full tracks).

Baseline measurement with original museval code (the total execution involves also computing the IRM, adapted from https://github.com/sigsep/sigsep-mus-oracle/blob/master/IRM.py):

museval bss original execution time, 1 track of musdb
pybin: /home/sevagh/venvs/museval-orig/bin/python3
evaluating track AM Contra - Heart Peripheral

real    3m22.702s
user    3m21.577s
sys     0m39.376s

The original code takes ~3:20 minutes.

The second optimization uses cupy and the GPU, which is in my opinion a big cost/burden for end users. Installing the CUDA toolkit etc. is no joke. Here is the code: https://github.com/sigsep/sigsep-mus-eval/compare/master...sevagh:feat/cupy-accel However, the performance is rather good at ~1:20 minutes, so maybe almost ~3x faster than the original code:

museval bss optimization 2 (cupy on gpu) execution time, 1 track of musdb
pybin: /home/sevagh/venvs/museval-optimization-2/bin/python3
evaluating track AM Contra - Heart Peripheral

real    1m19.801s
user    1m27.077s
sys     0m29.615s

One final note is that the CUDA/cupy version has slight differences in the outputs due to numerical precision differences. It doesn't look too significant to me - here's an excerpt of a diff between the evaluated json files, showing small differences in the BSS scores:

@@ -10459,8 +10459,8 @@
-            "SAR": 30.60528,
-            "ISR": 30.67039
+            "SAR": 30.60525,
+            "ISR": 30.67036
@@ -10469,8 +10469,8 @@
-            "SAR": 30.45440,
-            "ISR": 30.52629
+            "SAR": 30.45438,
+            "ISR": 30.52627
@@ -10480,7 +10480,7 @@
-            "ISR": 20.99668
+            "ISR": 20.99667

I'm also trying to find a way to use CPU parallelism with scipy.fft and combining several of the FFTs in a single call, but this isn't really helping as much as the CUDA change. My code attempts can be seen here: https://github.com/sigsep/sigsep-mus-eval/compare/master...sevagh:multiple-1d-fft

I'm aware of the separate repo for bss at https://github.com/sigsep/bsseval/ but I wasn't sure which project to discuss it in - I'm using museval because I'm trying to recreate the SiSec 2018 testbench.

sevagh commented 3 years ago

Also there could be a "super-performant" config with cupy, stacking multiple 1D FFTs (respecting GPU memory allocation limits), and using pinned host/gpu memory and FFT plans - I'll continue working in that direction.

sevagh commented 3 years ago

Optimized every slow line (discovered through kernprof + line_profiler): https://github.com/sigsep/sigsep-mus-eval/compare/master...sevagh:feat/cupy-accel

This leads to just about 1 minute to compute the IRM mask and perform a BSS evaluation on 1 full-length MUSDB18 track:

real    1m1.762s
user    0m50.948s
sys     0m13.620s

This is down from the 3+ minutes originally:

real    3m22.702s
user    3m21.577s
sys     0m39.376s

faroit commented 3 years ago

@sevagh i think this would be great. Do the regression tests pass using this?

sevagh commented 3 years ago

How can I run the tests? python setup.py test?

faroit commented 3 years ago

install the test evironment pip install .[tests] and then run

py.test tests/test_regression.py -vs

sevagh commented 3 years ago

OK. My most recent commits get the regression tests passing. Casting explicitly to float32 was creating huge errors in SAR/SIR/ISR, so I just removed them.

I made the cupy install optional (although fixed to CUDA 11.4, which is rather recent).

Other notes/idiosyncrasies is that it's best to clear the cupy FFT cache between BSS evaluations of large songs. That's why I added this helper function: https://github.com/sigsep/sigsep-mus-eval/compare/master...sevagh:feat/cupy-accel#diff-cc17d32a9d811e616624c2f2699f853dd06b143931ea9e37a6cc0dab6a4b8ab9R75-R88

In real code you would do:

for track in mus.tracks:
    ...
    scores = museval.eval_mus_track(...) # cupy under the hood
    museval.clear_cupy_cache()

Passing regression test:

(museval-cupy) sevagh:sigsep-mus-eval $ py.test tests/test_regression.py -vs
===================================================== test session starts =====================================================
platform linux -- Python 3.9.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/sevagh/venvs/museval-cupy/bin/python
cachedir: .pytest_cache
rootdir: /home/sevagh/repos/sigsep-mus-eval, configfile: setup.cfg
collected 4 items

tests/test_regression.py::test_aggregate[Music Delta - 80s Rock]     time         target metric     score                   track
[...]
Aggrated Scores (median over frames, median over tracks)
vocals          ==> SDR: -15.622  SIR:   9.165  ISR:  -8.476  SAR:  -7.327
accompaniment   ==> SDR: -13.290  SIR: -18.765  ISR:  -0.322  SAR:  -7.427

PASSED
tests/test_regression.py::test_track_scores[Music Delta - 80s Rock] PASSED
tests/test_regression.py::test_random_estimate[Music Delta - 80s Rock] PASSED
tests/test_regression.py::test_one_estimate[Music Delta - 80s Rock] PASSED

====================================================== warnings summary =======================================================
../../venvs/museval-cupy/lib/python3.9/site-packages/past/builtins/misc.py:45
  /home/sevagh/venvs/museval-cupy/lib/python3.9/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    from imp import reload

tests/test_regression.py: 12 warnings
  /home/sevagh/repos/sigsep-mus-eval/museval/metrics.py:601: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    eps = np.finfo(np.float).eps

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============================================== 4 passed, 13 warnings in 46.33s ===============================================

sigsep / sigsep-mus-eval

Optimize BSS use of FFT with cupy, speed up of up to 3x for full tracks #83