Huge performance penalty with `pd.arrays.SparseArray`

mflova commented 5 months ago

I would like to use this tool, but having such a big performance issue makes it unusable with big sparse arrays. Below there is the code I use to benchmark the issue. I usually work with dataframes with +1M columns but the benchmark is just with 100k. Although this tendency can be also seen with pd.Series (as expected) sparse arrays experiment a much bigger performance issue:

import numpy as np
import pandas as pd
import pint_pandas

M = 100_000
pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype="pint[rpm]")  # 3.7s
pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)   # 0.043s

pd.Series([1,2,3]*M, dtype="pint[rpm]")  # 0.336s
pd.Series([1,2,3]*M, dtype=np.float64)  # 0.04s

Here is the output of pyinstrument:

3.657 <module>  delete.py:1
├─ 3.650 SparseArray.__init__  pandas\core\arrays\sparse\array.py:364
│  ├─ 3.320 _make_sparse  pandas\core\arrays\sparse\array.py:1848
│  │  ├─ 3.287 PintArray.__array__  pint_pandas\pint_array.py:761
│  │  │  ├─ 3.285 PintArray._to_array_of_quantity  pint_pandas\pint_array.py:768
│  │  │  │  ├─ 2.228 <listcomp>  pint_pandas\pint_array.py:769
│  │  │  │  │  ├─ 1.822 Quantity.__new__  pint\facets\plain\quantity.py:194
│  │  │  │  │  │  ├─ 0.669 SharedRegistryObject.__new__  pint\util.py:958
│  │  │  │  │  │  │  ├─ 0.307 [self]  pint\util.py
│  │  │  │  │  │  │  ├─ 0.110 _handle_fromlist  <frozen importlib._bootstrap>:1053
│  │  │  │  │  │  │  │  ├─ 0.081 [self]  <frozen importlib._bootstrap>
│  │  │  │  │  │  │  │  ├─ 0.019 hasattr  <built-in>
│  │  │  │  │  │  │  │  └─ 0.010 isinstance  <built-in>
│  │  │  │  │  │  │  ├─ 0.086 hasattr  <built-in>
│  │  │  │  │  │  │  ├─ 0.086 ModuleSpec.parent  <frozen importlib._bootstrap>:404
│  │  │  │  │  │  │  │  ├─ 0.057 [self]  <frozen importlib._bootstrap>
│  │  │  │  │  │  │  │  └─ 0.029 str.rpartition  <built-in>

It seems the main penalty is creating a list where a N Quantity objects are created. I tried with quick alternatives but did not manage to find anything that does not break. Any possible ideas? Wouldn't it be possible to just create a numpy array with one assigned quantity instead of N Quantity objects? Thanks!

mflova commented 5 months ago

On the other hand, I saw how among those 0.336s of the pd.Series, 0.1s (30%) was just about parsing the file containing the units. This one is not expected to change during the Python runtime. Isn't is possible to cache this part? This is happening in pint though...

andrewgsavage commented 5 months ago

reading the definitions file is cached in latest pint release

can you try the different methods in: https://pint-pandas.readthedocs.io/en/latest/user/initializing.html

It would be good to add a note in the docs to suggest the most performant method.

andrewgsavage commented 5 months ago

You can also use the SparseArray as the magnitude of the PintArray:

sa = pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)
pa = pint_pandas.PintArray(sa, dtype="pint[rpm]")
type(pa.data)

If you want better support for storing data in SparseArrays or other Arrays do comment in https://github.com/hgrecco/pint-pandas/issues/192

mflova commented 5 months ago

reading the definitions file is cached in latest pint release

can you try the different methods in: https://pint-pandas.readthedocs.io/en/latest/user/initializing.html

It would be good to add a note in the docs to suggest the most performant method.

Sure, I did a quick benchmark test. Ordered from quickest to slowest: (test_series is just the standard pd.Series with np.float64. Number in parenthesis indicates the factor with respect to the best metric found)

Code used:

# Requires pytest, pytest-benchmark + pint related dependencies
import numpy as np
import pandas as pd
import pint_pandas
import pytest

PA_ = pint_pandas.PintArray

ureg = pint_pandas.PintType.ureg

Q_ = ureg.Quantity

@pytest.fixture
def M() -> int:
    return 1_000

def test_series(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype=np.float64)}))

def test_A(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype="pint[m]")}))

def test_B(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"B": pd.Series([0]*M).astype("pint[m]")}))

def test_C(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"C": PA_([0]*M, dtype="pint[m]")}))

def test_D(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"D": PA_([0]*M, dtype="m")}))

def test_E(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"E": PA_([0]*M, dtype=ureg.m)}))

def test_F(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"F": PA_.from_1darray_quantity(Q_([0]*M, ureg.m))}))

def test_G(M: int, benchmark) -> None:
    benchmark(lambda: pd.DataFrame({"G": PA_(Q_([0]*M, ureg.m))}))

mflova commented 5 months ago

I also benchmarked the time penalties for both pd.Series and pd.arrays.SparseArray in a more realistic way. To be honest, when compared to the built-in Pandas implementation, it seems it is not that bad when it comes to not much data:

Sparse data:

Dense data: Here I called quick to the quickest method found in the previous comment. The other one is just the "standard" way

In this case, the pint implementation is just 1.35 times slower than the built-in pandas one for the Sparse Array. Looking at these numbers I am not sure if there is room for improvement. I will close the issue and re-open it again if this is a problem :)

hgrecco / pint-pandas

Huge performance penalty with `pd.arrays.SparseArray` #228