pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

PERF: Slow pandas import on Kubernetes #48967

Closed saul-data closed 2 years ago

saul-data commented 2 years ago

Pandas version checks

Reproducible Example

The import of Pandas 1.5 seems very slow - between 2 and 5 seconds on Openshift. Not sure if there is anyway to speed that up? I tried putting the pip files into tmpfs folder using memory volume but it is a very tricky installation and the performance benefit was marginal. Seems to be OK for Ipython like Jupyter. We use it a lot for running python scripts in a DAG / data pipeline, each python script is executed with /bin/sh -c python3 pythonfile.py using https://github.com/dataplane-app/dataplane as our data pipeline / ETL tool.

Here is my test:

from datetime import datetime

start = datetime.now()
import pandas as pd 
# import redis
duration = datetime.now() - start

print(str(duration))

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-0.bpo.9-amd64 Version : #1 SMP Debian 5.10.70-1~bpo10+1 (2021-10-10) machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.8.2 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : 2022.8.2 scipy : None snappy : None sqlalchemy : 1.4.40 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Prior Performance

Prior versions 1.0.1 had much better import times around 1 second (still feels pretty slow).

akx commented 2 years ago

If you can install the moreutils package that has ts in it, you can then use Python's -verbose mode

python -v -c 'import pandas' 2>&1 | ts "%.S"

to get a timestamped log of what imports are taking time, and naturally you can also run that program with cProfile to get a real profiling output to figure out where the bottlenecks are.

saul-data commented 2 years ago

@akx thank you, I will give that a try this week and share the results.

phofl commented 2 years ago

Closing for now, please ping to reopen, when you can provide more information

saul-data commented 2 years ago

@phofl I think I have figured out why import is so slow. You won't find the slow down on a Macbook or any fast local SSD drive. You will see this slow down on say 0.3 CPU in Kubernetes where often the disk is over some network. I have seen up to 12 seconds here.

Take for example

import pandas as pd
pd.read_csv() 

Because of __all__ in the __ini__.py file to make it user friendly, it is loading the entire library, not just read.csv() - this is a very large number of files to download and clearly needs more CPU to download all those files.

https://github.com/pandas-dev/pandas/blob/main/pandas/__init__.py

I am not sure if it is possible to import specific modules that are needed instead of the entire package. Maybe I am doing something wrong but __all___ in the __ini__.py file seems to prevent that.

I saw this in my own Python package. I'd have significant speed improvement when I didn't use __all__ and referenced the modules directly. I tested it by using @akx command and I could see all the files it was trying to import. The first command below imported far less files than the second.

So with Pandas - I feel if you are only going to use say read_csv(), it should only import those specific files and this should have a significant speed improvement.

python -v -c 'from dataplane.pipelinerun.data_persist.redis_store import pipeline_redis_store'"

100ms import time

vs

python -v -c 'from dataplane'"

2 second import time

Example difference

from dataplane.pipelinerun.data_persist.redis_store import pipeline_redis_store
pipeline_redis_store()

instead of

import dataplane
dataplane.pipeline_redis_store()
saul-data commented 2 years ago

@akx @phofl here is that output - pandas in Kubernetes loads in 3.4 seconds

python -v -c 'import pandas' 2>&1 | ts "%.S"

https://gist.github.com/saul-data/250d977df5f834a82c6bfea85d64f16a#file-pandas-import-time-kubernetes-log

phofl commented 2 years ago

Can you open a dedicated issue for this (improving import performance), since this is not really related to Kubernetes itself. This should help facilitate the discussion

saul-data commented 2 years ago

@phofl can I not just change the title and keep the thread going?

phofl commented 2 years ago

I think it would be helpful for the discussion if the issue does not contain much irrelevant information. I can open this one too, if you feel strongly about it, but in my experience keeping issues narrow helps a lot