Closed mflova closed 5 months ago
On the other hand, I saw how among those 0.336s
of the pd.Series
, 0.1s (30%)
was just about parsing the file containing the units. This one is not expected to change during the Python runtime. Isn't is possible to cache this part? This is happening in pint
though...
reading the definitions file is cached in latest pint release
can you try the different methods in: https://pint-pandas.readthedocs.io/en/latest/user/initializing.html
It would be good to add a note in the docs to suggest the most performant method.
You can also use the SparseArray as the magnitude of the PintArray:
sa = pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)
pa = pint_pandas.PintArray(sa, dtype="pint[rpm]")
type(pa.data)
If you want better support for storing data in SparseArrays or other Arrays do comment in https://github.com/hgrecco/pint-pandas/issues/192
reading the definitions file is cached in latest pint release
can you try the different methods in: https://pint-pandas.readthedocs.io/en/latest/user/initializing.html
It would be good to add a note in the docs to suggest the most performant method.
Sure, I did a quick benchmark test. Ordered from quickest to slowest: (test_series is just the standard pd.Series
with np.float64
. Number in parenthesis indicates the factor with respect to the best metric found)
Code used:
# Requires pytest, pytest-benchmark + pint related dependencies
import numpy as np
import pandas as pd
import pint_pandas
import pytest
PA_ = pint_pandas.PintArray
ureg = pint_pandas.PintType.ureg
Q_ = ureg.Quantity
@pytest.fixture
def M() -> int:
return 1_000
def test_series(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype=np.float64)}))
def test_A(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype="pint[m]")}))
def test_B(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"B": pd.Series([0]*M).astype("pint[m]")}))
def test_C(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"C": PA_([0]*M, dtype="pint[m]")}))
def test_D(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"D": PA_([0]*M, dtype="m")}))
def test_E(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"E": PA_([0]*M, dtype=ureg.m)}))
def test_F(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"F": PA_.from_1darray_quantity(Q_([0]*M, ureg.m))}))
def test_G(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"G": PA_(Q_([0]*M, ureg.m))}))
I also benchmarked the time penalties for both pd.Series
and pd.arrays.SparseArray
in a more realistic way. To be honest, when compared to the built-in Pandas
implementation, it seems it is not that bad when it comes to not much data:
Sparse data
:
Dense data
: Here I called quick
to the quickest method found in the previous comment. The other one is just the "standard" way
In this case, the pint
implementation is just 1.35 times slower than the built-in pandas one for the Sparse Array. Looking at these numbers I am not sure if there is room for improvement. I will close the issue and re-open it again if this is a problem :)
I would like to use this tool, but having such a big performance issue makes it unusable with big sparse arrays. Below there is the code I use to benchmark the issue. I usually work with dataframes with +1M columns but the benchmark is just with 100k. Although this tendency can be also seen with
pd.Series
(as expected) sparse arrays experiment a much bigger performance issue:Here is the output of
pyinstrument
:It seems the main penalty is creating a list where a N
Quantity
objects are created. I tried with quick alternatives but did not manage to find anything that does not break. Any possible ideas? Wouldn't it be possible to just create a numpy array with one assigned quantity instead of NQuantity
objects? Thanks!