pylhc / tfs

Python package to handle TFS files
https://pylhc.github.io/tfs/
MIT License
9 stars 4 forks source link

Deprecations and FutureWarnings Fixes #128

Closed fsoubelet closed 4 months ago

fsoubelet commented 10 months ago

This PR brings changes to address the two deprecations / warnings which aren't very far away from being actual issues. I have bumped the package version as a patch, but this can be discussed.

Deprecated use of .applymap

The call to .applymap in tfs.writer._quote_string_columns has been changed to .map (a total equivalent) since .applymap is deprecated and will be removed in (most likely) the next major pandas version.

Deprecated use of delim_whitespace in pd.read_csv when reading

The delim_whitespace option is now deprecated. A simple fix was to use sep="\s+" as suggested in the issued warning. The option has been around for a long time and comes with no version / timeline considerations.

FutureWarning from pd.option_context('mode.use_inf_as_na', True)

The use of the pd.option_context('mode.use_inf_as_na', True) context manager in tfs.frame.validate raises a FutureWarning which recommends to convert instead. It's not clear from the warning when this functionality will stop being supported.

To circumvent this, the .isna() call is now made after calling .replace([np.inf, -np.inf], np.nan). Since .replace returns a copy we are not modifying the provided dataframe. Comments have been added above the relevant code for developpers.

According to the accepted answer in this StackOverflow thread, the performance impact should be quite small: about 0.1s on a dataframe with 1M rows (and most likely faster on newer Python versions since the .replace source code uses a lot of native Python). Could be compared with the runtime impact of the current with pd.option_context('mode.use_inf_as_na', True) call, through the impact seems small enought that I did not benchmark.

FutureWarning from .replace([np.inf, -np.inf], np.nan)

The above change now also raises a future warning about object downcasting during the operation. A fix is done by first calling .infer_dtypes() on the dataframe to infer types where possible as suggested in the issued warning. This call creates a lazy copy (for copy-on-write behaviour that will come in pandas 3.0 but the following call to .replace() creates a hard copy which erases all concerns about changing the original dataframe during validation. The .infer_dtypes() function has been around for a long time and comes with no version / timeline considerations.

Suggested Timeline

I suggest we leave this PR as a draft for now. According to the pandas repo, pandas 3.0 is due for April 1st, 2024 and we will most likely encounter issues with the current tfs-pandas code at this time.

By then, Python 3.8 support will be 6 months away from complete end of life and I believe it will be reasonable to simply drop support for it ourselves (which merging this PR and publishing will do).

In short: if we want to be conservative with our supported Python versions, we should keep this PR open until just before pandas 3.0 comes around (~ April), and merge + publish a new version of tfs-pandas then. Otherwise we can merge and publish sooner.

Note on CI

The CI is currently messed up since we have Python 3.8 in our tests matrix and the package, in this branch, can't install on this version (since we ask for pandas >= 2.1 which does not exist for Python 3.8).

If we're waiting until close to April to merge and drop Python 3.8 support, we could at the same time remove it from our CI matrices.

JoschD commented 9 months ago

To consider: maybe increase minor version instead of patch? I don't like when a patch version breaks the backwards compatibility with python/pandas packages like this.

fsoubelet commented 9 months ago

Fully agreed.

fsoubelet commented 4 months ago

@JoschD I suggest we drop 3.8 from our CI matrices as we did for omc3 and go forward with merging this PR and releasing a new version of the package.

The minimum required pandas version has been out for close to a year and I expect no friction with dependencies resolving.