twosigma / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

TRACKER: Numba engine performance with rolling operations #52

Open mroeschke opened 3 years ago

mroeschke commented 3 years ago

As of

In [1]: pd.__version__
Out[1]: '1.4.0.dev0+1085.g01b86edbbb'

From this ASV

class NumbaVSCython:
    params = (
        ["sum", "max", "min", "median", "mean"],
        [
            ("cython", None),
            ("numba", {"parallel": True}),
            ("numba", {"parallel": False}),
        ],
        [1, 100],
    )

    param_names = ["method", "engine_kwargs", "cols"]

    def setup(self, method, engine_kwargs, cols):
        self.engine, self.engine_kwargs = engine_kwargs
        self.roll = pd.DataFrame(np.random.randn(10_000, cols)).rolling(100)
        getattr(self.roll, method)(engine=self.engine, engine_kwargs=self.engine_kwargs)

    def time_method(self, method, engine_kwargs, cols):
        getattr(self.roll, method)(engine=self.engine, engine_kwargs=self.engine_kwargs)

mean and sum have the sliding algorithms implemented. min, max, median use np.nanmethod

[ 50.00%] ··· ======== ================================ ========== ==========
              --                                                 cols
              ----------------------------------------- ---------------------
               method           engine_kwargs               1         100
              ======== ================================ ========== ==========
                sum            ('cython', None)          297±0μs    27.5±0ms
                sum     ('numba', {'parallel': True})    436±0μs    16.9±0ms
                sum     ('numba', {'parallel': False})   318±0μs    19.3±0ms
                max            ('cython', None)          439±0μs    46.0±0ms
                max     ('numba', {'parallel': True})    818±0μs    86.1±0ms
                max     ('numba', {'parallel': False})   3.15±0ms   328±0ms
                min            ('cython', None)          454±0μs    48.8±0ms
                min     ('numba', {'parallel': True})    801±0μs    93.5±0ms
                min     ('numba', {'parallel': False})   3.23±0ms   332±0ms
               median          ('cython', None)          6.36±0ms   643±0ms
               median   ('numba', {'parallel': True})    5.47±0ms   567±0ms
               median   ('numba', {'parallel': False})   15.5±0ms   1.55±0s
                mean           ('cython', None)          299±0μs    28.3±0ms
                mean    ('numba', {'parallel': True})    545±0μs    19.1±0ms
                mean    ('numba', {'parallel': False})   474±0μs    31.9±0ms
              ======== ================================ ========== ==========
mroeschke commented 2 years ago

As of

In [1]: pd.__version__
Out[1]: '1.5.0.dev0+110.g439906e07d'
[ 75.00%] ··· rolling.NumbaVSCython.time_gb_method                                                                          12/30 failed
[ 75.00%] ··· ======== ================================ ============= =============
              --                                                    cols
              ----------------------------------------- ---------------------------
               method           engine_kwargs                 1            100
              ======== ================================ ============= =============
                sum            ('cython', None)            351±5μs     2.33±0.02ms
                sum     ('numba', {'parallel': True})    1.34±0.01ms    8.52±0.1ms
                sum     ('numba', {'parallel': False})   1.07±0.01ms   12.1±0.05ms
                max            ('cython', None)             failed        failed
                max     ('numba', {'parallel': True})       failed        failed
                max     ('numba', {'parallel': False})      failed        failed
                min            ('cython', None)             failed        failed
                min     ('numba', {'parallel': True})       failed        failed
                min     ('numba', {'parallel': False})      failed        failed
                var            ('cython', None)            157±5μs     2.27±0.05ms
                var     ('numba', {'parallel': True})    1.38±0.01ms    10.6±0.1ms
                var     ('numba', {'parallel': False})   1.09±0.01ms   16.4±0.06ms
                mean           ('cython', None)            148±2μs     2.00±0.03ms
                mean    ('numba', {'parallel': True})    1.40±0.01ms    10.7±0.2ms
                mean    ('numba', {'parallel': False})   1.10±0.03ms   16.9±0.07ms
              ======== ================================ ============= =============

[100.00%] ··· rolling.NumbaVSCython.time_roll_method                                                                        12/30 failed
[100.00%] ··· ======== ================================ ========== ============
              --                                                  cols
              ----------------------------------------- -----------------------
               method           engine_kwargs               1          100
              ======== ================================ ========== ============
                sum            ('cython', None)          461±2μs    27.8±0.2ms
                sum     ('numba', {'parallel': True})    365±10μs   12.3±0.2ms
                sum     ('numba', {'parallel': False})   358±2μs    14.8±0.4ms
                max            ('cython', None)           failed      failed
                max     ('numba', {'parallel': True})     failed      failed
                max     ('numba', {'parallel': False})    failed      failed
                min            ('cython', None)           failed      failed
                min     ('numba', {'parallel': True})     failed      failed
                min     ('numba', {'parallel': False})    failed      failed
                var            ('cython', None)          598±9μs     34.7±2ms
                var     ('numba', {'parallel': True})    428±10μs   13.1±0.4ms
                var     ('numba', {'parallel': False})   447±3μs    19.7±0.3ms
                mean           ('cython', None)          502±4μs     30.8±2ms
                mean    ('numba', {'parallel': True})    484±20μs   14.8±0.5ms
                mean    ('numba', {'parallel': False})   496±20μs    25.6±1ms
              ======== ================================ ========== ============
mroeschke commented 2 years ago

Runs with different thread levels and more cols

% NUMBA_NUM_THREADS=4 asv run -b rolling.NumbaVSCython

[ 75.00%] ··· rolling.NumbaVSCython.time_gb_method                                                                                                             ok
[ 75.00%] ··· ======== ================================ ============= ============= =============
              --                                                           cols
              ----------------------------------------- -----------------------------------------
               method           engine_kwargs                 1            100           1000
              ======== ================================ ============= ============= =============
                sum            ('cython', None)            379±50μs     2.35±0.2ms    19.8±0.3ms
                sum     ('numba', {'parallel': True})    1.37±0.01ms   8.30±0.03ms    76.6±0.5ms
                sum     ('numba', {'parallel': False})   1.09±0.01ms   12.0±0.01ms    119±0.9ms
                max            ('cython', None)            307±2μs     2.40±0.01ms   20.4±0.08ms
                max     ('numba', {'parallel': True})    3.11±0.02ms     70.0±3ms      695±20ms
                max     ('numba', {'parallel': False})   2.24±0.01ms    131±0.3ms      1.36±0s
                min            ('cython', None)            309±2μs     2.44±0.01ms   21.0±0.06ms
                min     ('numba', {'parallel': True})    3.11±0.01ms     69.3±3ms      705±2ms
                min     ('numba', {'parallel': False})   2.24±0.01ms    132±0.4ms     1.37±0.01s
                var            ('cython', None)           155±0.5μs    2.24±0.08ms    21.5±0.3ms
                var     ('numba', {'parallel': True})    1.41±0.01ms   10.3±0.06ms     101±1ms
                var     ('numba', {'parallel': False})   1.13±0.01ms    16.3±0.1ms     169±1ms
                mean           ('cython', None)           146±0.7μs    1.95±0.02ms    18.7±0.5ms
                mean    ('numba', {'parallel': True})     1.45±0.1ms    10.8±0.6ms     107±3ms
                mean    ('numba', {'parallel': False})   1.13±0.01ms   16.6±0.09ms     188±1ms
              ======== ================================ ============= ============= =============

[100.00%] ··· rolling.NumbaVSCython.time_roll_method                                                                                                           ok
[100.00%] ··· ======== ================================ ============= ============ ==========
              --                                                         cols
              ----------------------------------------- -------------------------------------
               method           engine_kwargs                 1           100         1000
              ======== ================================ ============= ============ ==========
                sum            ('cython', None)            452±5μs     26.3±0.2ms   309±20ms
                sum     ('numba', {'parallel': True})      372±20μs     12.6±1ms    132±3ms
                sum     ('numba', {'parallel': False})    346±0.7μs    14.4±0.3ms   190±2ms
                max            ('cython', None)            641±3μs      46.0±1ms    504±2ms
                max     ('numba', {'parallel': True})     3.18±0.2ms   32.4±0.5ms   328±3ms
                max     ('numba', {'parallel': False})    2.87±0.3ms   59.0±0.4ms   614±2ms
                min            ('cython', None)            653±10μs    43.9±0.1ms   503±3ms
                min     ('numba', {'parallel': True})    1.13±0.06ms   30.4±0.3ms   327±2ms
                min     ('numba', {'parallel': False})     1.28±0ms    58.1±0.8ms   614±2ms
                var            ('cython', None)            586±1μs     32.7±0.7ms   359±1ms
                var     ('numba', {'parallel': True})      440±10μs    13.0±0.4ms   140±5ms
                var     ('numba', {'parallel': False})     450±6μs     20.0±0.3ms   240±1ms
                mean           ('cython', None)            468±3μs     26.9±0.2ms   319±3ms
                mean    ('numba', {'parallel': True})      457±10μs    14.2±0.5ms   152±2ms
                mean    ('numba', {'parallel': False})     472±9μs     24.3±0.3ms   286±2ms
              ======== ================================ ============= ============ ==========
%  NUMBA_NUM_THREADS=2 asv run -b rolling.NumbaVSCython
[ 75.00%] ··· rolling.NumbaVSCython.time_gb_method                                                                                                             ok
[ 75.00%] ··· ======== ================================ ============= ============= ============
              --                                                          cols
              ----------------------------------------- ----------------------------------------
               method           engine_kwargs                 1            100          1000
              ======== ================================ ============= ============= ============
                sum            ('cython', None)           438±100μs      3.90±2ms    38.1±20ms
                sum     ('numba', {'parallel': True})      1.87±1ms      12.5±6ms    80.5±10ms
                sum     ('numba', {'parallel': False})   1.12±0.08ms    12.3±0.7ms   120±0.6ms
                max            ('cython', None)            308±1μs     2.31±0.02ms   20.5±0.1ms
                max     ('numba', {'parallel': True})    2.30±0.01ms    72.7±0.5ms    758±6ms
                max     ('numba', {'parallel': False})   2.24±0.02ms    131±0.4ms     1.36±0s
                min            ('cython', None)            306±3μs     2.40±0.03ms   20.9±0.1ms
                min     ('numba', {'parallel': True})    2.31±0.01ms    71.2±0.7ms    752±7ms
                min     ('numba', {'parallel': False})     2.23±0ms     132±0.5ms     1.38±0s
                var            ('cython', None)            156±1μs     2.16±0.08ms   21.5±0.2ms
                var     ('numba', {'parallel': True})    1.17±0.01ms    10.8±0.6ms    104±1ms
                var     ('numba', {'parallel': False})   1.13±0.01ms   16.4±0.08ms   169±0.9ms
                mean           ('cython', None)            146±3μs     2.01±0.03ms   18.3±0.3ms
                mean    ('numba', {'parallel': True})    1.17±0.01ms    11.2±0.3ms    113±2ms
                mean    ('numba', {'parallel': False})   1.13±0.01ms    17.0±0.3ms    187±1ms
              ======== ================================ ============= ============= ============

[100.00%] ··· rolling.NumbaVSCython.time_roll_method                                                                                                           ok
[100.00%] ··· ======== ================================ ============= ============ ===========
              --                                                         cols
              ----------------------------------------- --------------------------------------
               method           engine_kwargs                 1           100          1000
              ======== ================================ ============= ============ ===========
                sum            ('cython', None)            455±6μs     26.4±0.4ms    306±2ms
                sum     ('numba', {'parallel': True})      286±1μs     10.4±0.4ms   131±0.3ms
                sum     ('numba', {'parallel': False})     351±3μs     14.5±0.3ms    190±1ms
                max            ('cython', None)            633±5μs     44.6±0.7ms    505±1ms
                max     ('numba', {'parallel': True})    2.04±0.02ms   33.5±0.1ms    358±3ms
                max     ('numba', {'parallel': False})   2.45±0.03ms   59.4±0.4ms    609±1ms
                min            ('cython', None)            647±5μs     45.7±0.2ms    505±3ms
                min     ('numba', {'parallel': True})      798±5μs     32.2±0.4ms    348±4ms
                min     ('numba', {'parallel': False})   1.27±0.01ms   58.2±0.4ms   608±0.8ms
                var            ('cython', None)            585±6μs     32.6±0.8ms    370±6ms
                var     ('numba', {'parallel': True})     336±0.8μs    13.1±0.4ms    157±2ms
                var     ('numba', {'parallel': False})     455±3μs     20.1±0.3ms   239±0.7ms
                mean           ('cython', None)            468±2μs     27.0±0.6ms    316±2ms
                mean    ('numba', {'parallel': True})      396±8μs     15.5±0.5ms   182±0.9ms
                mean    ('numba', {'parallel': False})     452±4μs     24.6±0.5ms    273±1ms
              ======== ================================ ============= ============ ===========
mroeschke commented 2 years ago

No modifications, param'd over threads (no change from above)

[ 75.00%] ··· ======== ================================ ====== ============= =============
              --                                                         threads
              ------------------------------------------------ ---------------------------
               method           engine_kwargs            cols        2             4
              ======== ================================ ====== ============= =============
                sum            ('cython', None)           1       363±10μs      376±20μs
                sum            ('cython', None)          100     2.38±0.1ms    2.45±0.7ms
                sum            ('cython', None)          1000    19.8±0.8ms     21.1±4ms
                sum     ('numba', {'parallel': True})     1     1.15±0.08ms   1.36±0.03ms
                sum     ('numba', {'parallel': True})    100     8.54±0.3ms    8.44±0.3ms
                sum     ('numba', {'parallel': True})    1000     82.3±8ms      82.1±3ms
                sum     ('numba', {'parallel': False})    1     1.08±0.01ms   1.09±0.01ms
                sum     ('numba', {'parallel': False})   100    12.0±0.03ms   12.0±0.09ms
                sum     ('numba', {'parallel': False})   1000     121±2ms       121±2ms
                max            ('cython', None)           1       315±1μs      317±0.2μs
                max            ('cython', None)          100    2.39±0.01ms   2.39±0.01ms
                max            ('cython', None)          1000    20.5±0.2ms    20.5±0.2ms
                max     ('numba', {'parallel': True})     1     2.33±0.01ms   3.10±0.01ms
                max     ('numba', {'parallel': True})    100      73.7±1ms      70.1±1ms
                max     ('numba', {'parallel': True})    1000     767±3ms      712±0.7ms
                max     ('numba', {'parallel': False})    1     2.25±0.02ms   2.26±0.04ms
                max     ('numba', {'parallel': False})   100      132±2ms       131±2ms
                max     ('numba', {'parallel': False})   1000     1.37±0s       1.37±0s
                min            ('cython', None)           1       312±1μs      315±0.3μs
                min            ('cython', None)          100    2.43±0.01ms   2.43±0.01ms
                min            ('cython', None)          1000    20.9±0.1ms    20.9±0.2ms
                min     ('numba', {'parallel': True})     1       2.31±0ms    3.13±0.01ms
                min     ('numba', {'parallel': True})    100     71.7±0.3ms     69.5±2ms
                min     ('numba', {'parallel': True})    1000     766±2ms       714±4ms
                min     ('numba', {'parallel': False})    1     2.30±0.04ms   2.26±0.02ms
                min     ('numba', {'parallel': False})   100      133±1ms       132±1ms
                min     ('numba', {'parallel': False})   1000    1.40±0.03s     1.38±0s
                var            ('cython', None)           1       163±1μs       161±3μs
                var            ('cython', None)          100    2.17±0.03ms   2.31±0.04ms
                var            ('cython', None)          1000    21.5±0.1ms   21.5±0.07ms
                var     ('numba', {'parallel': True})     1     1.17±0.01ms   1.42±0.02ms
                var     ('numba', {'parallel': True})    100    10.9±0.02ms    10.5±0.1ms
                var     ('numba', {'parallel': True})    1000     104±1ms       101±2ms
                var     ('numba', {'parallel': False})    1       1.13±0ms    1.15±0.01ms
                var     ('numba', {'parallel': False})   100     16.3±0.1ms   16.4±0.09ms
                var     ('numba', {'parallel': False})   1000     170±2ms       170±1ms
                mean           ('cython', None)           1       154±1μs      154±0.9μs
                mean           ('cython', None)          100    1.99±0.04ms   1.95±0.05ms
                mean           ('cython', None)          1000    19.1±0.9ms    18.8±0.3ms
                mean    ('numba', {'parallel': True})     1     1.18±0.01ms   1.44±0.01ms
                mean    ('numba', {'parallel': True})    100    10.9±0.03ms    10.5±0.1ms
                mean    ('numba', {'parallel': True})    1000     118±1ms       104±3ms
                mean    ('numba', {'parallel': False})    1     1.15±0.01ms   1.14±0.01ms
                mean    ('numba', {'parallel': False})   100    16.6±0.03ms    17.0±0.2ms
                mean    ('numba', {'parallel': False})   1000     186±1ms      187±0.9ms
              ======== ================================ ====== ============= =============

[100.00%] ··· rolling.NumbaVSCython.time_roll_method                                                                                                           ok
[100.00%] ··· ======== ================================ ============= ============= ============= ============= ========== ==========
              --                                                                        cols / threads
              ----------------------------------------- -----------------------------------------------------------------------------
               method           engine_kwargs               1 / 2         1 / 4        100 / 2       100 / 4     1000 / 2   1000 / 4
              ======== ================================ ============= ============= ============= ============= ========== ==========
                sum            ('cython', None)            459±7μs       461±5μs      26.3±0.2ms   26.0±0.09ms   308±3ms    308±4ms
                sum     ('numba', {'parallel': True})      293±2μs       371±20μs     10.4±0.4ms    12.1±0.5ms   131±1ms    131±3ms
                sum     ('numba', {'parallel': False})     370±8μs       367±5μs      14.4±0.3ms    14.9±0.3ms   192±1ms    191±3ms
                max            ('cython', None)            646±10μs      642±2μs       45.5±1ms     44.1±0.2ms   504±2ms    505±3ms
                max     ('numba', {'parallel': True})    2.07±0.01ms    3.33±0.2ms    33.9±0.4ms    32.0±0.5ms   362±2ms    329±2ms
                max     ('numba', {'parallel': False})   2.49±0.01ms   2.48±0.01ms    58.1±0.3ms    59.1±0.6ms   611±3ms    614±10ms
                min            ('cython', None)            654±5μs       652±5μs     44.2±0.08ms    44.7±0.8ms   506±5ms    510±5ms
                min     ('numba', {'parallel': True})      812±10μs    1.12±0.05ms    33.0±0.5ms    30.8±0.9ms   364±1ms    331±4ms
                min     ('numba', {'parallel': False})   1.31±0.02ms   1.29±0.01ms    57.7±0.4ms    57.7±0.4ms   616±2ms    614±4ms
                var            ('cython', None)            599±2μs       597±8μs      31.8±0.5ms    31.7±0.1ms   368±3ms    366±1ms
                var     ('numba', {'parallel': True})     343±0.6μs      442±20μs     12.8±0.4ms    13.2±0.4ms   158±2ms    141±3ms
                var     ('numba', {'parallel': False})     459±4μs       461±4μs      20.0±0.3ms    19.8±0.3ms   239±2ms    240±3ms
                mean           ('cython', None)            485±6μs       480±10μs     27.1±0.3ms    27.0±0.1ms   323±3ms    322±2ms
                mean    ('numba', {'parallel': True})      393±3μs       476±8μs      15.3±0.4ms    14.6±0.4ms   181±3ms    156±3ms
                mean    ('numba', {'parallel': False})     459±1μs       466±1μs      24.8±0.2ms    24.5±0.6ms   288±5ms    282±10ms
              ======== ================================ ============= ============= ============= ============= ========== ==========
mroeschke commented 2 years ago

xref https://github.com/numba/numba/issues/4031 but our functions do not specify parallel=False in the inner function kernels

mroeschke commented 2 years ago

Some other parallel diagnostics from this local timeit test setup

import numba
cols = 1000
df = pd.DataFrame(np.random.randn(10_000, cols))
roll = df.rolling(100)
# cache
roll.mean(engine="numba", engine_kwargs={"nopython": True, "nogil": True, "parallel": True})
%timeit roll.mean(engine="numba", engine_kwargs={"nopython": True, "nogil": True, "parallel": True})

Threading backend

Backend timeit
omp 209 ms ± 7.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
tbb 221 ms ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
workqueue 220 ms ± 6.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Setting threads with omp backend

Threads timeit
1 347 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2 201 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3 206 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba function setup

Setup timeit
2D w/ np.nanmean 933 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2D w/ custom nanmean 634 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)