unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.4k stars 311 forks source link

Initial schema creation is very slow #1810

Closed bluenote10 closed 1 month ago

bluenote10 commented 2 months ago

Describe the ~bug~ issue

This is more of a usability issue than a bug: The initial creation of a schema is very slow. I'm measuring it around ~800 ms, which can be a significant slow down e.g. in quick/small CLI tools that otherwise have a sub-second runtime.

Code Sample, a copy-pastable example

I'm observing runtime of >800ms even for the most simplest usages like this:

import time

import pandera as pa

t1 = time.monotonic()
schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int),
        "column2": pa.Column(float),
        "column3": pa.Column(str),
    }
)
t2 = time.monotonic()
print(t2 - t1)

Expected behavior

Faster execution of simple usages.

Desktop (please complete the following information):

cosmicBboy commented 1 month ago

@bluenote10 thanks! if you have time, would you mind providing a runtime profile either with cProfile or your profiling library of choice?

This'll provide more actionable data on what parts of the execution path are slowing things down

Expected behavior

Faster execution of simple usages.

How fast are you expecting?

cosmicBboy commented 1 month ago

Also, can you provide your python environment to repro? I get:

0.3683174999896437

When I run the script above

bluenote10 commented 1 month ago

How fast are you expecting?

From a user perspective the pa.DataFrameSchema(...) expression only constructs a Python class instance, and there is no obvious work to do in the constructor (no data is involved yet), so it would be sensible to expect <1 ms.

A guess: Could it be an effect the lazy import system? I've seen that https://github.com/unionai-oss/pandera/issues/1644 mentions these ~800 ms as the import time as well. Unfortunately the Python ecosystem seems to suffer more and more from slow import times. Lazy imports largely "postpone" the issue, i.e., it may just happen now in the first usage of that constructor.

A module initialization time of 800 ms feels a lot. I'm wondering what all these packages/modules are doing at import time to lead to such a slow import. I've attached some information on the Python environment and a cProfile run. Can you spot something obvious why it is taking so much time?

Python environment (pip freeze output) ``` actionlib==1.14.0 adal==1.2.7 aiofiles==22.1.0 aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 aiosqlite==0.20.0 alabaster==0.7.16 altair==5.4.1 angles==1.9.13 annotated-types==0.7.0 ansi2html==1.9.2 ansible==9.10.0 ansible-core==2.16.11 antlr4-python3-runtime==4.13.2 anyio==4.4.0 anys==0.3.0 appdirs==1.4.4 argcomplete==3.5.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 argparse-addons==0.12.0 arrow==1.3.0 asammdf==8.0.0 asttokens==2.4.1 async-timeout==4.0.3 attrs==24.2.0 avro-python3==1.10.2 azure-batch==14.2.0 azure-common==1.1.28 azure-containerregistry==1.2.0 azure-core==1.31.0 azure-cosmos==4.7.0 azure-data-tables==12.5.0 azure-devops==7.1.0b4 azure-eventgrid==4.20.0 azure-eventhub==5.12.1 azure-eventhub-checkpointstoreblob==1.1.4 azure-functions==1.20.0 azure-functions-durable==1.2.9 azure-identity==1.17.1 azure-keyvault==4.2.0 azure-keyvault-certificates==4.8.0 azure-keyvault-keys==4.9.0 azure-keyvault-secrets==4.8.0 azure-kusto-data==4.5.1 azure-kusto-ingest==4.5.1 azure-mgmt-batch==17.3.0 azure-mgmt-compute==33.0.0 azure-mgmt-consumption==10.0.0 azure-mgmt-containerinstance==10.1.0 azure-mgmt-core==1.4.0 azure-mgmt-datafactory==9.0.0 azure-mgmt-keyvault==10.3.1 azure-mgmt-network==26.0.0 azure-mgmt-resource==23.1.1 azure-mgmt-storage==21.2.1 azure-mgmt-web==7.3.1 azure-monitor-ingestion==1.0.4 azure-servicebus==7.12.2 azure-storage-blob==12.23.0 azure-storage-queue==12.12.0 babel==2.16.0 beautifulsoup4==4.12.3 bidict==0.23.1 bitstruct==8.19.0 black==22.12.0 bleach==6.1.0 blinker==1.8.2 blosc2==2.7.1 bokeh==3.5.2 bondpy==1.8.6 boolean.py==3.4 branca==0.7.2 build==1.2.2 cachetools==5.5.0 cachier==3.0.1 camera-calibration-parsers==1.12.0 canmatrix==1.0 cantools==39.4.5 catkin==0.8.10 catkin-pkg==1.0.0 certifi==2024.8.30 cffi==1.17.1 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 click-log==0.4.0 cloudpickle==3.0.0 codeowners==0.7.0 colorama==0.4.6 coloredlogs==15.0.1 colorlog==6.8.2 comm==0.2.2 conan==2.7.1 contourpy==1.3.0 coverage==7.6.1 crccheck==1.3.0 cryptography==43.0.1 cssselect==1.2.0 cv-bridge==1.16.2 cycler==0.12.1 Cython==3.0.11 dacite==1.8.1 dash==2.18.1 dash-core-components==2.0.0 dash-html-components==2.0.0 dash-table==5.0.0 dask==2024.9.0 dask-expr==1.1.14 debugpy==1.8.5 decopatch==1.4.10 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.14 diagnostic-updater==1.11.0 dill==0.3.8 dirhash==0.5.0 diskcache==5.6.3 distributed==2024.9.0 distro==1.8.0 docker==7.1.0 docopt==0.6.2 docutils==0.20.1 dohq-artifactory==0.10.0 doxysphinx==3.3.10 dynamic-reconfigure==1.7.3 empy==3.3.4 entrypoints==0.4 exceptiongroup==1.2.2 execnet==2.1.1 executing==2.1.0 fastapi==0.115.0 fastapi-azure-auth==5.0.1 fasteners==0.19 fastjsonschema==2.20.0 filelock==3.16.1 flake8==7.1.1 flake8-bugbear==24.8.19 flake8-tidy-imports==4.10.0 Flask==3.0.3 Flask-Cors==5.0.0 Flask-PyMongo==2.3.0 folium==0.17.0 fonttools==4.53.1 fqdn==1.5.1 frozenlist==1.4.1 fsspec==2024.9.0 furl==2.1.3 future==1.0.0 gcovr==7.2 gencpp==0.7.0 geneus==3.0.0 genlisp==0.4.18 genmsg==0.6.0 gennodejs==2.0.2 genpy==0.6.15 geographiclib==1.52 geojson==3.1.0 geojson-pydantic==1.1.1 geopandas==1.0.1 geopy==2.4.1 gitdb==4.0.11 GitPython==3.1.43 gnupg==2.3.1 google-auth==2.34.0 gpstime==0.8.2 graphviz==0.20.3 gunicorn==23.0.0 h11==0.14.0 h2==4.1.0 h5py==3.11.0 hpack==4.0.0 httpcore==1.0.5 httpx==0.27.2 humanfriendly==10.0 hyperframe==6.0.1 icontract==2.7.0 idna==3.10 ijson==3.3.0 image-geometry==1.16.2 imageio==2.35.1 imagesize==1.4.1 importlib_metadata==8.4.0 importlib_resources==6.4.5 iniconfig==2.0.0 interactive-markers==1.12.0 ipykernel==6.29.5 ipympl==0.9.4 ipython==8.21.0 ipython-genutils==0.2.0 ipywidgets==7.8.4 isal==1.7.0 isodate==0.6.1 isoduration==20.11.0 isort==5.13.2 itsdangerous==2.2.0 jedi==0.19.1 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 joint-state-publisher==1.15.1 jsk-recognition-utils==1.2.15 jsk_data==2.2.12 jsk_network_tools==2.2.12 jsk_rviz_plugins==2.1.8 jsk_tools==2.2.12 jsk_topic_tools==2.2.12 json5==0.9.25 jsonpointer==3.0.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-ydoc==0.2.5 jupyter_client==7.4.9 jupyter_core==5.7.2 jupyter_server==2.14.2 jupyter_server_fileid==0.9.3 jupyter_server_terminals==0.5.3 jupyter_server_ydoc==0.8.0 jupyterlab==3.6.8 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.3 jupyterlab_widgets==1.1.10 keplergl==0.3.2 kiwisolver==1.4.7 kubernetes==30.1.0 laser_geometry==1.6.7 lazy_loader==0.4 libsass==0.22.0 llvmlite==0.43.0 locket==1.0.0 lxml==4.9.4 lz4==4.3.3 lzstring==1.0.4 maison==2.0.0 makefun==1.15.4 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.8.4 matplotlib-inline==0.1.7 mccabe==0.7.0 mdit-py-plugins==0.4.2 mdurl==0.1.2 memory-profiler==0.61.0 mercantile==1.2.1 message-filters==1.16.0 microsoft-kiota-abstractions==1.3.3 microsoft-kiota-authentication-azure==1.1.0 microsoft-kiota-http==1.3.3 microsoft-kiota-serialization-form==0.1.1 microsoft-kiota-serialization-json==1.3.2 microsoft-kiota-serialization-multipart==0.1.0 microsoft-kiota-serialization-text==1.0.0 mistune==3.0.2 mock==5.1.0 mongomock==4.2.0.post1 mpire==2.10.2 mpld3==0.5.10 mpmath==1.3.0 msal==1.28.1 msal-extensions==1.1.0 msgpack==1.0.8 msgraph-core==1.1.3 msgraph-sdk==1.7.0 msgspec==0.18.6 msrest==0.7.1 msrestazure==0.6.4.post1 multidict==6.1.0 multimethod==1.10 multiprocess==0.70.16 mypy==1.11.2 mypy-extensions==1.0.0 myst-parser==4.0.0 narwhals==1.8.1 nbclassic==1.1.0 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 ndindex==1.8 nest-asyncio==1.6.0 netifaces==0.11.0 networkx==3.3 nose==1.3.7 notebook==6.5.7 notebook_shim==0.2.4 numba==0.60.0 numexpr==2.10.1 numpy==1.26.4 numpy-quaternion==2023.0.4 oauthlib==3.2.2 opencv-python-headless==4.10.0.84 openni2_launch==1.6.0 openrouteservice==2.3.3 opentelemetry-api==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 orderedmultidict==1.0.1 osm2geojson==0.2.5 osmium==3.7.0 overpass==0.7.2 overrides==7.7.0 packaging==24.1 pandas==2.2.2 pandas-stubs==2.2.2.240909 pandera==0.20.4 pandocfilters==1.5.1 parso==0.8.4 partd==1.4.2 patch-ng==1.18.0 pathos==0.3.2 pathspec==0.12.1 pendulum==3.0.0 pexpect==4.9.0 pick==2.4.0 pillow==10.4.0 pip-tools==7.4.1 pkg_resources==0.0.0 platformdirs==4.3.6 plotly==5.24.1 pluggy==1.5.0 portalocker==2.10.1 pox==0.3.4 ppft==1.7.6.8 progressbar2==4.5.0 prometheus_client==0.20.0 prompt_toolkit==3.0.47 proto-schema-parser==1.3.6 protobuf==4.25.3 psutil==6.0.0 psycopg==3.2.2 psycopg-binary==3.2.2 ptyprocess==0.7.0 pure_eval==0.2.3 py==1.11.0 py-cpuinfo==9.0.0 pyarrow==17.0.0 pyarrow-hotfix==0.6 pyasn1==0.6.1 pyasn1_modules==0.4.1 pycodestyle==2.12.1 pycparser==2.22 pycryptodomex==3.20.0 pydantic==2.9.2 pydantic_core==2.23.4 pydeck==0.9.1 pyflakes==3.2.0 Pygments==2.18.0 pyjson5==1.6.6 PyJWT==2.9.0 pylddwrap==1.2.2 pymap3d==1.6.3 pymongo==3.13.0 Pympler==1.1 pyogrio==0.9.0 pyOpenSSL==24.2.1 pyparsing==3.1.4 pypcd==0.1.1 pyproj==3.6.1 pyproject_hooks==1.1.0 pyros==0.4.3 pyros-common==0.5.4 pyros-config==0.2.1 pyros-setup==0.3.0 pyrosbag==0.1.3 pyserial==3.5 PySocks==1.7.1 pysolr==3.10.0 pytest==8.3.3 pytest-asyncio==0.24.0 pytest-cases==3.8.5 pytest-cov==5.0.0 pytest-mock==3.14.0 pytest-timeout==2.3.1 pytest-watch==4.2.0 pytest-xdist==3.6.1 python-can==4.4.2 python-dateutil==2.9.0.post0 python-debian==0.1.49 python-geohash==0.8.5 python-intervals==1.10.0.post1 python-json-logger==2.0.7 python-lzf==0.2.6 python-qt-binding==0.4.4 python-utils==3.8.2 pytz==2024.2 PyYAML==6.0.2 pyzmp==0.0.17 pyzmq==26.2.0 qt-dotgraph==0.4.2 qt-gui==0.4.2 qt-gui-cpp==0.4.2 qt-gui-py-common==0.4.2 redis==5.0.8 referencing==0.35.1 requests==2.32.3 requests-file==2.1.0 requests-oauthlib==2.0.0 requests-toolbelt==1.0.0 resolvelib==1.0.1 resource_retriever==1.12.7 retry==0.9.2 retrying==1.3.4 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.8.1 rosbag==1.16.0 rosclean==1.15.8 rosdep==0.25.1 rosdistro==0.9.1 rosgraph==1.16.0 roslaunch==1.16.0 roslib==1.15.8 roslint==0.12.0 roslz4==1.16.0 rosmake==1.15.8 rosmaster==1.16.0 rosmsg==1.16.0 rosnode==1.16.0 rosparam==1.16.0 rospkg==1.5.1 rospy==1.16.0 rosservice==1.16.0 rostest==1.16.0 rostopic==1.16.0 rosunit==1.15.8 roswtf==1.16.0 rpds-py==0.20.0 rqt-image-view==0.4.17 rqt-reconfigure==0.5.5 rqt_action==0.4.9 rqt_bag==0.5.1 rqt_bag_plugins==0.5.1 rqt_console==0.4.11 rqt_dep==0.4.12 rqt_graph==0.4.14 rqt_gui==0.5.3 rqt_gui_py==0.5.3 rqt_launch==0.4.9 rqt_logger_level==0.4.11 rqt_msg==0.4.10 rqt_plot==0.4.13 rqt_publisher==0.4.10 rqt_py_common==0.5.3 rqt_py_console==0.4.10 rqt_service_caller==0.4.10 rqt_shell==0.4.11 rqt_srv==0.4.9 rqt_top==0.4.10 rqt_topic==0.4.13 rqt_web==0.4.10 rsa==4.9 Rtree==1.3.0 ruamel.yaml==0.18.6 ruamel.yaml.clib==0.2.8 ruyaml==0.91.0 rviz==1.14.20 s2cell==1.8.0 scantree==0.0.4 scikit-image==0.24.0 scikit-learn==1.5.2 scipy==1.14.1 seaborn==0.13.2 Send2Trash==1.8.3 sensor-msgs==1.13.1 sentinels==1.0.0 shapely==2.0.6 shellcheck-py==0.10.0.1 simplejson==3.19.3 six==1.16.0 smclib==1.8.6 smmap==5.0.1 smmap2==2.0.5 sniffio==1.3.1 snowballstemmer==2.2.0 sortedcontainers==2.4.0 sound-play==0.3.17 soupsieve==2.6 Sphinx==7.4.7 sphinx-autodoc-typehints==2.3.0 sphinx-charts==0.2.1 sphinx-click==6.0.0 sphinx-collections==0.0.1 sphinx-copybutton==0.5.2 sphinx-data-viewer==0.1.5 sphinx-needs==3.0.0 sphinx-rtd-theme==2.0.0 sphinx-tags==0.4 sphinx_design==0.6.1 sphinxcontrib-applehelp==2.0.0 sphinxcontrib-devhelp==2.0.0 sphinxcontrib-doxylink==1.12.3 sphinxcontrib-htmlhelp==2.1.0 sphinxcontrib-jquery==4.1 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-plantuml==0.30 sphinxcontrib-qthelp==2.0.0 sphinxcontrib-serializinghtml==2.0.0 sphinxcontrib-svg2pdfconverter==1.2.3 splunk-handler==3.0.0 sqlitedict==2.1.0 stack-data==0.6.3 starlette==0.38.5 std-uritemplate==1.0.6 streamlit==1.38.0 sympy==1.13.3 tables==3.10.1 tabulate==0.9.0 tblib==3.0.0 tenacity==8.5.0 termcolor==2.4.0 terminado==0.18.1 textparser==0.24.0 tf==1.13.2 tf-conversions==1.13.2 tf2-geometry-msgs==0.7.6 tf2-kdl==0.7.6 tf2-py==0.7.6 tf2-ros==0.7.6 threadpoolctl==3.5.0 tifffile==2024.8.30 time-machine==2.15.0 tinycss2==1.3.0 tokenize-rt==6.0.0 toml==0.10.2 tomli==2.0.1 toolz==0.12.1 topic-tools==1.16.0 toposort==1.10 torch @ https://download.pytorch.org/whl/cpu/torch-2.3.1%2Bcpu-cp310-cp310-linux_x86_64.whl#sha256=d679e21d871982b9234444331a26350902cfd2d5ca44ce6f49896af8b3a3087d torcheval==0.0.7 torchinfo==1.8.0 torchvision @ https://download.pytorch.org/whl/cpu/torchvision-0.18.1%2Bcpu-cp310-cp310-linux_x86_64.whl#sha256=2ae9d4e4e11bc43c7ee6c7c7e87b1e6adf5503ad0710e59cd86bc7b1a342d75a tornado==6.4.1 tqdm==4.66.5 traitlets==5.9.0 traittypes==0.2.1 typed-argparse==0.3.1 typeguard==4.3.0 types-beautifulsoup4==4.12.0.20240907 types-cffi==1.16.0.20240331 types-click==7.1.8 types-docutils==0.21.0.20240907 types-filelock==3.2.7 types-html5lib==1.1.11.20240806 types-Jinja2==2.11.9 types-jsonschema==4.23.0.20240813 types-lxml==2024.9.16 types-MarkupSafe==1.1.10 types-mock==5.1.0.20240425 types-protobuf==4.25.0.20240417 types-psutil==6.0.0.20240901 types-pyOpenSSL==24.1.0.20240722 types-python-dateutil==2.9.0.20240906 types-pytz==2024.2.0.20240913 types-PyYAML==6.0.12.20240917 types-redis==4.6.0.20240903 types-requests==2.31.0.6 types-retry==0.9.9.4 types-setuptools==75.1.0.20240917 types-simplejson==3.19.0.20240801 types-six==1.16.21.20240513 types-tabulate==0.9.0.20240106 types-termcolor==1.1.6.2 types-urllib3==1.26.25.14 typing-inspect==0.9.0 typing_extensions==4.12.2 tzdata==2024.1 tzlocal==5.2 urdfdom-py==0.4.6 uri-template==1.3.0 urllib3==1.26.20 uvicorn==0.30.6 uwsgidecorators==1.1.0 validators==0.33.0 watchdog==4.0.2 wcwidth==0.2.13 webcolors==24.8.0 webencodings==0.5.1 websocket-client==1.8.0 Werkzeug==3.0.4 widgetsnbextension==3.6.9 wrapt==1.16.0 xacro==1.14.16 xyzservices==2024.9.0 y-py==0.6.2 yachalk==0.1.6 yamlfix==1.17.0 yarl==1.11.1 ypy-websocket==0.8.4 zict==3.0.0 zipp==3.20.2 ```

And here is the output of a cProfile of that snippet: pandera_cprofile.txt

cosmicBboy commented 1 month ago

Thanks for the details! #1818 should bring schema initialization time close to 0: running the code snippet in the description of this issue yields

0.0005101249553263187
bluenote10 commented 1 month ago

https://github.com/unionai-oss/pandera/pull/1818 should bring schema initialization time close to 0

Awesome! I had a quick look into the approach taken there, and the idea looks very sensible to me. Thanks for the fix!