Open koalp opened 1 year ago
Thanks for opening this @koalp ! I think a good solution here is to check if the type of the incoming data matches the expected type, and only coercing/re-assigning columns that don't match.
Will circle back to this issue once https://github.com/unionai-oss/pandera/pull/913 is merged
Describe the bug
Validating against a
SchemaModel
with several hundred is used with coerce takes a lot of time, even if the dataframe is already valid. It doesn’t occur when there is nocoerce
.Code Sample, a copy-pastable example
In this gist you will find a script that compares execution time with and without coerce : https://gist.github.com/koalp/0e70303c014712a6f7f790b5743482a3
Expected behavior
That the coercion doesn’t take so much time when the dtype is already good. It would be even better to not be slow when all the columns must be converted.
Desktop (please complete the following information):
Additional context
After running benchmarks, I found out that the
__setattr__
function¹ from pandas (replacing a column) takes a lot of time to run. (python 3.9) If I modify pandera to onlysetattr
it the result fromtry_coercion
differs from the previous column it solves my issue as I currently only have 1 or less column that need to be changed (wrong dtype). However, it isn’t a generic solution as it doesn’t help when a lot of columns have a wrong dtype.On discord, a modification was suggested: