microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.75k stars 495 forks source link

automl broken with pandas 2 #1300

Open jgukelberger opened 2 months ago

jgukelberger commented 2 months ago

Several of the official examples and tests currently error out due to incompatibility with pandas 2.

For example, when following AutoML - Time series forecast in a fresh Python 3.10 environment, pip install "flaml[automl,ts_forecast]" currently installs pandas 2.2.2. Then, the first example raises

TypeError: cannot infer freq from a non-convertible index of dtype float64

Similarly, 3/9 tests in test/automl/test_forecast.py fail.

There's also #984, but that focuses on supporting pandas 2. In the meantime, the pandas dependency in setup.py should at least be constrained to <2 to avoid users ending up with a broken installation by default.

sonichi commented 2 months ago

Thanks. Feel free to make a PR. cc @thinkall

thinkall commented 2 months ago

Thanks @jgukelberger , @sonichi , I don't see the issue with pandas 2.0.3 and 2.2.2

numpy 1.24.3 pandas 2.2.2 / 2.0.3

image

jgukelberger commented 2 months ago

That's interesting @sonichi. Attached is the full pip list output for the environment I'm seeing these errors in. The errors are fixed with pip install "pandas<2".

The only difference between the two environments is pandas:

$ diff pipenv-works.txt pipenv-fails.txt
31c31
< pandas                1.5.3
---
> pandas                2.2.2

Here's the full output of a failing test case:

$ pytest -v test/automl/test_forecast.py -k test_numpy
=============================================== test session starts ================================================
platform linux -- Python 3.10.14, pytest-7.4.0, pluggy-1.0.0 -- /home/jagukelb/opt/miniconda3/envs/flaml-test/bin/python
cachedir: .pytest_cache
rootdir: /home/jagukelb/src/experiments/FLAML
configfile: pyproject.toml
collected 9 items / 7 deselected / 2 selected

test/automl/test_forecast.py::test_numpy FAILED                                                              [ 50%]
test/automl/test_forecast.py::test_numpy_large PASSED                                                        [100%]

===================================================== FAILURES =====================================================
____________________________________________________ test_numpy ____________________________________________________

    def test_numpy():
        X_train = np.arange("2014-01", "2021-01", dtype="datetime64[M]")
        y_train = np.random.random(size=len(X_train))
        automl = AutoML()
>       automl.fit(
            X_train=X_train[:72],  # a single column of timestamp
            y_train=y_train[:72],  # value for each timestamp
            period=12,  # time horizon to forecast, e.g., 12 months
            task="ts_forecast",
            time_budget=3,  # time budget in seconds
            log_file_name="test/ts_forecast.log",
            n_splits=3,  # number of splits
        )

test/automl/test_forecast.py:126:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
flaml/automl/automl.py:1664: in fit
    task.validate_data(
flaml/automl/task/time_series_task.py:166: in validate_data
    data = TimeSeriesDataset(
flaml/automl/time_series/ts_data.py:57: in __init__
    self.frequency = pd.infer_freq(train_data[time_col].unique())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

index = array([1.3885344e+09, 1.3912128e+09, 1.3936320e+09, 1.3963104e+09,
       1.3989024e+09, 1.4015808e+09, 1.4041728e+09,...8e+09, 1.5593472e+09, 1.5619392e+09, 1.5646176e+09,
       1.5672960e+09, 1.5698880e+09, 1.5725664e+09, 1.5751584e+09])

    def infer_freq(
        index: DatetimeIndex | TimedeltaIndex | Series | DatetimeLikeArrayMixin,
    ) -> str | None:
        """
        Infer the most likely frequency given the input index.

        Parameters
        ----------
        index : DatetimeIndex, TimedeltaIndex, Series or array-like
          If passed a Series will use the values of the series (NOT THE INDEX).

        Returns
        -------
        str or None
            None if no discernible frequency.

        Raises
        ------
        TypeError
            If the index is not datetime-like.
        ValueError
            If there are fewer than three values.

        Examples
        --------
        >>> idx = pd.date_range(start='2020/12/01', end='2020/12/30', periods=30)
        >>> pd.infer_freq(idx)
        'D'
        """
        from pandas.core.api import DatetimeIndex

        if isinstance(index, ABCSeries):
            values = index._values
            if not (
                lib.is_np_dtype(values.dtype, "mM")
                or isinstance(values.dtype, DatetimeTZDtype)
                or values.dtype == object
            ):
                raise TypeError(
                    "cannot infer freq from a non-convertible dtype "
                    f"on a Series of {index.dtype}"
                )
            index = values

        inferer: _FrequencyInferer

        if not hasattr(index, "dtype"):
            pass
        elif isinstance(index.dtype, PeriodDtype):
            raise TypeError(
                "PeriodIndex given. Check the `freq` attribute "
                "instead of using infer_freq."
            )
        elif lib.is_np_dtype(index.dtype, "m"):
            # Allow TimedeltaIndex and TimedeltaArray
            inferer = _TimedeltaFrequencyInferer(index)
            return inferer.get_freq()

        elif is_numeric_dtype(index.dtype):
>           raise TypeError(
                f"cannot infer freq from a non-convertible index of dtype {index.dtype}"
            )
E           TypeError: cannot infer freq from a non-convertible index of dtype float64

../../../opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/pandas/tseries/frequencies.py:148: TypeError
================================================= warnings summary =================================================
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/src/experiments/FLAML/flaml/automl/time_series/ts_data.py:121: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
    return pd.concat([self.X_train, self.X_val], axis=0)

test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/src/experiments/FLAML/test/automl/test_forecast.py:158: FutureWarning: 'T' is deprecated and will be removed in a future version, please use 'min' instead.
    X_train = pd.date_range("2017-01-01", periods=70000, freq="T")

test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/prophet/models.py:16: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

test/automl/test_forecast.py: 20 warnings
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/lightgbm/basic.py:696: UserWarning: Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
    _log_warning("Usage of np.ndarray subset (sliced data) is not recommended "

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================= short test summary info ==============================================
FAILED test/automl/test_forecast.py::test_numpy - TypeError: cannot infer freq from a non-convertible index of dtype float64
============================= 1 failed, 1 passed, 7 deselected, 24 warnings in 19.07s ==============================

And here when it works after downgrading pandas:

$ pytest -v test/automl/test_forecast.py -k test_numpy
=============================================== test session starts ================================================
platform linux -- Python 3.10.14, pytest-7.4.0, pluggy-1.0.0 -- /home/jagukelb/opt/miniconda3/envs/flaml-test/bin/python
cachedir: .pytest_cache
rootdir: /home/jagukelb/src/experiments/FLAML
configfile: pyproject.toml
collected 9 items / 7 deselected / 2 selected

test/automl/test_forecast.py::test_numpy PASSED                                                              [ 50%]
test/automl/test_forecast.py::test_numpy_large PASSED                                                        [100%]

================================================= warnings summary =================================================
test/automl/test_forecast.py::test_numpy
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/prophet/models.py:16: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

test/automl/test_forecast.py: 22 warnings
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1641: DeprecationWarning: np.find_common_type is deprecated.  Please use `np.result_type` or `np.promote_types`.
  See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
    return np.find_common_type(types, [])

test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy_large
test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/lightgbm/basic.py:696: UserWarning: Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
    _log_warning("Usage of np.ndarray subset (sliced data) is not recommended "

test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
    self._init_dates(dates, freq)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================== 2 passed, 7 deselected, 34 warnings in 20.20s ===================================

pipenv-fails.txt pipenv-works.txt

yareyaredesuyo commented 1 week ago

Similar error happens, both kaggle and colab environment.

numpy: 1.25.2 pandas: 2.0.3 flaml: 2.1.2

Screenshot 2024-06-18 at 14 50 02 Screenshot 2024-06-18 at 14 50 11
1.25.2
2.0.3
2.1.2
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-3-97dd957c5800>](https://localhost:8080/#) in <cell line: 13>()
     11 y_train = np.random.random(size=84)
     12 automl = AutoML()
---> 13 automl.fit(
     14     X_train=X_train[:84],  # a single column of timestamp
     15     y_train=y_train,  # value for each timestamp

2 frames
[/usr/local/lib/python3.10/dist-packages/flaml/automl/time_series/ts_data.py](https://localhost:8080/#) in __init__(self, train_data, time_col, target_names, time_idx, test_data)
     56 
     57         self.frequency = pd.infer_freq(train_data[time_col].unique())
---> 58         assert self.frequency is not None, "Only time series of regular frequency are currently supported."
     59 
     60         float_cols = list(train_data.select_dtypes(include=["floating"]).columns)

AssertionError: Only time series of regular frequency are currently supported.