nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.65k stars 210 forks source link

Python 3.12.1 with pandarallel==1.6.5 usage of parallel_apply time increase X3 #261

Open mdclone-oa opened 9 months ago

mdclone-oa commented 9 months ago

General

Acknowledgement

after upgrading to Python 3.12 from Python 3.10 the usage of parallel_apply increased almost X3. running on docker with 8.9 (Ootpa)

this is the information about the OS that the docker is running

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Python 3.12 packages

annotated-types==0.6.0
astroid==3.0.1
attrs==23.1.0
Cerberus==1.3.5
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
coverage==7.3.2
cycler==0.12.1
debugpy==1.8.0
dill==0.3.7
distlib==0.3.7
docopt==0.6.2
execnet==2.0.2
fonttools==4.46.0
idna==3.6
iniconfig==2.0.0
isort==5.13.0
Jinja2==3.1.2
joblib==1.3.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.8.2
mccabe==0.7.0
mlxtend==0.23.0
numpy==1.26.2
packaging==23.2
pandarallel==1.6.5
pandas==2.1.3
pep517==0.13.1
pika==1.3.2
Pillow==10.1.0
pip-api==0.0.30
pipreqs==0.4.13
platformdirs==4.1.0
plette==0.4.4
pluggy==1.3.0
psutil==5.9.6
py-cpuinfo==9.0.0
pydantic==2.5.2
pydantic_core==2.14.5
pylint==3.0.2
pyparsing==3.1.1
pytest==7.4.3
pytest-benchmark==4.0.0
pytest-cov==4.1.0
pytest-html==4.1.1
pytest-metadata==3.0.0
pytest-mock==3.12.0
pytest-order==1.2.0
pytest-ordering==0.6
pytest-timeout==2.2.0
pytest-xdist==3.4.0
python-dateutil==2.8.2
pytz==2023.3.post1
redis==5.0.1
referencing==0.32.0
requests==2.31.0
requirementslib==3.0.0
rpds-py==0.13.2
scikit-learn==1.3.2
scipy==1.11.4
seaborn==0.13.0
setuptools==68.2.2
six==1.16.0
threadpoolctl==3.2.0
tomlkit==0.12.3
typing_extensions==4.9.0
tzdata==2023.3
urllib3==2.1.0
yarg==0.1.9

I can't add all my code but this is some of it.

results = combined.groupby(by='NewGroup').parallel_apply(
            lambda group: TestClass(data=group.drop(columns=columns, inplace=False)).run())

TestClass - init the class with the new data after the drop columns - is a list of columns that we need to drop run - is the function that runs on each group

the servers are the same and the code didn't change, but still, I got time increased almost by X3

with python 3.10.11 with pandarallel==1.6.5 and pandas==2.0.0 the same data frame takes 2.49 min and with the 3.12.1 it takes 7.22 min

nalepae commented 8 months ago

Pandaral·lel is looking for a maintainer! If you are interested, please open an GitHub issue.