Closed bluenote10 closed 1 month ago
@bluenote10 thanks! if you have time, would you mind providing a runtime profile either with cProfile or your profiling library of choice?
This'll provide more actionable data on what parts of the execution path are slowing things down
Expected behavior
Faster execution of simple usages.
How fast are you expecting?
Also, can you provide your python environment to repro? I get:
0.3683174999896437
When I run the script above
How fast are you expecting?
From a user perspective the pa.DataFrameSchema(...)
expression only constructs a Python class instance, and there is no obvious work to do in the constructor (no data is involved yet), so it would be sensible to expect <1 ms.
A guess: Could it be an effect the lazy import system? I've seen that https://github.com/unionai-oss/pandera/issues/1644 mentions these ~800 ms as the import time as well. Unfortunately the Python ecosystem seems to suffer more and more from slow import times. Lazy imports largely "postpone" the issue, i.e., it may just happen now in the first usage of that constructor.
A module initialization time of 800 ms feels a lot. I'm wondering what all these packages/modules are doing at import time to lead to such a slow import. I've attached some information on the Python environment and a cProfile run. Can you spot something obvious why it is taking so much time?
pip freeze
output)And here is the output of a cProfile
of that snippet: pandera_cprofile.txt
Thanks for the details! #1818 should bring schema initialization time close to 0: running the code snippet in the description of this issue yields
0.0005101249553263187
https://github.com/unionai-oss/pandera/pull/1818 should bring schema initialization time close to 0
Awesome! I had a quick look into the approach taken there, and the idea looks very sensible to me. Thanks for the fix!
Describe the ~bug~ issue
This is more of a usability issue than a bug: The initial creation of a schema is very slow. I'm measuring it around ~800 ms, which can be a significant slow down e.g. in quick/small CLI tools that otherwise have a sub-second runtime.
Code Sample, a copy-pastable example
I'm observing runtime of >800ms even for the most simplest usages like this:
Expected behavior
Faster execution of simple usages.
Desktop (please complete the following information):