pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
947 stars 103 forks source link

properly handle pandas 2.1.4 deprecation warning #203

Closed EpigeneMax closed 10 months ago

EpigeneMax commented 10 months ago

Following additional comments in #198, this PR improves on #199.

codecov-commenter commented 10 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (f43e282) 98.37% compared to head (cc44908) 98.37%.

Files Patch % Lines
patsy/util.py 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #203 +/- ## ======================================= Coverage 98.37% 98.37% ======================================= Files 30 30 Lines 3140 3140 Branches 695 695 ======================================= Hits 3089 3089 Misses 26 26 Partials 25 25 ``` | [Flag](https://app.codecov.io/gh/pydata/patsy/pull/203/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pydata) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/pydata/patsy/pull/203/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pydata) | `97.96% <0.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pydata#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

lesteve commented 10 months ago

Thanks for the PR, I'll try to check whether that gets rid of the warnings. In the mean time I have a few comments.

It would seem more straightforward to first try isinstance(..., pd.CategoricalDtype) since this is the recommended way since pandas 2.1.0 and I am guessing that it works for older pandas versions as well. Having said that I did not manage to find the relevant info about which pandas version supports it.

This does seem cleaner than trying to catch a warning with a deprecated code path and the warnings catching logic seems a bit brittle. Why catching the last warning rather than any warnings if there are more than one, although this is probably edge-casy.

Also when using the warnings catching logic, when pd.types.api.is_categorical_dtype gets removed from pandas you once again need to fix in patsy which does not seem great.

To finish with I don't think the code logic is correct. When you catch the warning, you set _pandas_is_categorical_dtype = None. I believe it was correct (although a bit convoluted) in your original PR #199 because you added a if _pandas_is_categorical_dtype is None afterwards, but a subsequent commit https://github.com/pydata/patsy/pull/199/commits/01c6b70fed8b6f00b85c79dedece01a297ad6f35 by @matthewwardrop broke it by removing the if clause.

All in all I do think that in 0.5.5 patsy.util.safe_is_categorical_pandas is currently broken with pandas 2.1.0 to 2.1.3:

import pandas as pd
import patsy
print(f"{pd.__version__=}")
import patsy
print(f"{patsy.__version__=}")

categorical_dtype = pd.CategoricalDtype(['a', 'b'])
print(patsy.util.safe_is_pandas_categorical_dtype(categorical_dtype))

Output:

pd.__version__='2.1.3'
patsy.__version__='0.5.5'
Traceback (most recent call last):
  File "/tmp/test.py", line 8, in <module>
    print(patsy.util.safe_is_pandas_categorical_dtype(categorical_dtype))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lesteve/micromamba/lib/python3.11/site-packages/patsy/util.py", line 681, in safe_is_pandas_categorical_dtype
    return _pandas_is_categorical_dtype(dt)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
matthewwardrop commented 10 months ago

It would seem more straightforward to first try isinstance(..., pd.CategoricalDtype) since this is the recommended way since pandas 2.1.0 and I am guessing that it works for older pandas versions as well. Having said that I did not manage to find the relevant info about which pandas version supports it.

Agreed. I took a quick look into this and didn't find it either when reviewing the original PR. I figured that the code would unbreak things in the indefinite future, and so wasn't too bothered.

Also when using the warnings catching logic, when pd.types.api.is_categorical_dtype gets removed from pandas you once again need to fix in patsy which does not seem great.

+1.

To finish with I don't think the code logic is correct. When you catch the warning, you set _pandas_is_categorical_dtype = None. I believe it was correct (although a bit convoluted) in your original PR https://github.com/pydata/patsy/pull/199 because you added a if _pandas_is_categorical_dtype is None afterwards, but a subsequent commit https://github.com/pydata/patsy/commit/01c6b70fed8b6f00b85c79dedece01a297ad6f35 by @matthewwardrop broke it by removing the if clause.

Guilty as charged. Should have looked over these changes a little bit more closely. Sorry.


I took a look into switching the order of these lookups again and it is subtle, which is why I didn't recommend switching the order in the original PR. There's small differences between the old and new behaviour, especially around looking up things from dtype strings. For the most part I doubt it matters, since the series type is realised already from actual pandas Series objects; and so we should just be able to invert the lookup check and that should work for any version of pandas that defined the Categorical dtype. We might even be able to just drop the alternative lookup path, which would simplify this considerably. I'll check this and put up a PR tomorrow.

lesteve commented 10 months ago

Sounds great, let me know if I can help!

EpigeneMax commented 10 months ago

Also when using the warnings catching logic, when pd.types.api.is_categorical_dtype gets removed from pandas you once again need to fix in patsy which does not seem great.

Guilty as charged, I did focus too much on detecting deprecation. I agree with you that the warning catching logic is not great, but I think it is good enough for this use case.

It would seem more straightforward to first try isinstance(..., pd.CategoricalDtype) since this is the recommended way since pandas 2.1.0

Agreed, but I did not want to break the behavior on older versions of pandas.

I'll let @matthewwardrop go with a PR, let me know if I can help!

matthewwardrop commented 10 months ago

Fixed in #204

lesteve commented 10 months ago

Nice, thanks a lot for this and the 0.5.6 release!

Also, since there there is an error for patsy 0.5.5 and pandas>=2.1.0,<=2.1.3 as noted in https://github.com/pydata/patsy/pull/203#issuecomment-1874942671, it may be worth considering yanking patsy 0.5.5 from PyPI and marking the package as broken in conda-forge?

It could well be the case that this is not worth it, since that not that many people will bump into this error and if they do, they will find this issue where the easy fix is to update patsy to 0.5.6.

matthewwardrop commented 10 months ago

I don't think there was a whole tonne of utility, but givent that is trivial I went ahead and yanked it from PyPI. Marking packages as broken on conda (or updating the metadata to require pandas < 2.1) is a little more involved, and not worth the effort, IMHO. If you feel differently, feel free to propose the changes via the appropriate conda-forge channels.

lesteve commented 10 months ago

Thanks :pray:, this seems perfectly good enough.

I was not too sure it was worth it either, so I hope I did not twist your arm too much :wink:.