Closed saul-data closed 2 years ago
If you can install the moreutils
package that has ts
in it, you can then use Python's -v
erbose mode
python -v -c 'import pandas' 2>&1 | ts "%.S"
to get a timestamped log of what imports are taking time, and naturally you can also run that program with cProfile
to get a real profiling output to figure out where the bottlenecks are.
@akx thank you, I will give that a try this week and share the results.
Closing for now, please ping to reopen, when you can provide more information
@phofl I think I have figured out why import is so slow. You won't find the slow down on a Macbook or any fast local SSD drive. You will see this slow down on say 0.3 CPU in Kubernetes where often the disk is over some network. I have seen up to 12 seconds here.
Take for example
import pandas as pd
pd.read_csv()
Because of __all__
in the __ini__.py
file to make it user friendly, it is loading the entire library, not just read.csv() - this is a very large number of files to download and clearly needs more CPU to download all those files.
https://github.com/pandas-dev/pandas/blob/main/pandas/__init__.py
I am not sure if it is possible to import specific modules that are needed instead of the entire package. Maybe I am doing something wrong but __all___
in the __ini__.py
file seems to prevent that.
I saw this in my own Python package. I'd have significant speed improvement when I didn't use __all__
and referenced the modules directly. I tested it by using @akx command and I could see all the files it was trying to import. The first command below imported far less files than the second.
So with Pandas - I feel if you are only going to use say read_csv(), it should only import those specific files and this should have a significant speed improvement.
python -v -c 'from dataplane.pipelinerun.data_persist.redis_store import pipeline_redis_store'"
100ms import time
vs
python -v -c 'from dataplane'"
2 second import time
Example difference
from dataplane.pipelinerun.data_persist.redis_store import pipeline_redis_store
pipeline_redis_store()
instead of
import dataplane
dataplane.pipeline_redis_store()
@akx @phofl here is that output - pandas in Kubernetes loads in 3.4 seconds
python -v -c 'import pandas' 2>&1 | ts "%.S"
Can you open a dedicated issue for this (improving import performance), since this is not really related to Kubernetes itself. This should help facilitate the discussion
@phofl can I not just change the title and keep the thread going?
I think it would be helpful for the discussion if the issue does not contain much irrelevant information. I can open this one too, if you feel strongly about it, but in my experience keeping issues narrow helps a lot
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The import of Pandas 1.5 seems very slow - between 2 and 5 seconds on Openshift. Not sure if there is anyway to speed that up? I tried putting the pip files into tmpfs folder using memory volume but it is a very tricky installation and the performance benefit was marginal. Seems to be OK for Ipython like Jupyter. We use it a lot for running python scripts in a DAG / data pipeline, each python script is executed with
/bin/sh -c python3 pythonfile.py
using https://github.com/dataplane-app/dataplane as our data pipeline / ETL tool.Here is my test:
Installed Versions
Prior Performance
Prior versions 1.0.1 had much better import times around 1 second (still feels pretty slow).