unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.11k stars 884 forks source link

Raise warning in `from_group_dataframe()` for monotonically increasing index. #1607

Closed dennisbader closed 1 year ago

dennisbader commented 1 year ago

See #1606.

In TimeSeries.from_group_dataframe():

halstonblim commented 4 weeks ago

Not sure if right place for this, but seems like UserWarning error is being raised even if time_col is specified in cases where the input df is sorted by time, df[time_col].is_monotic_increasing = True, even if all series have overlapping time indices. However, when df is not time sorted, the UserWarning is not raised. Shouldn't the UserWarning only be raised when time_col is not specified, regardless or how df is sorted?

e.g. running sample code from https://unit8co.github.io/darts/examples/15-static-covariates.html, and sorting vs. non-sorting by time leads to warning vs. no warning

df = pd.DataFrame(
    data={
        "dates": [
            "2020-01-01",
            "2020-01-02",
            "2020-01-03",
            "2020-01-01",
            "2020-01-02",
            "2020-01-03",
        ],
        "comp1": np.random.random((6,)),
        "comp2": np.random.random((6,)),
        "comp3": np.random.random((6,)),
        "ID": ["SERIES1", "SERIES1", "SERIES1", "SERIES2", "SERIES2", "SERIES2"],
        "var1": [0.5, 0.5, 0.5, 0.75, 0.75, 0.75],
    }
)
print("Input DataFrame")
print(df)
df = df.sort_values(["dates","ID"]) ### <==== Sorting df by time causes UserWarning to be raised

series_multi = TimeSeries.from_group_dataframe(
    df,
    time_col="dates",
    group_cols="ID",  # individual time series are extracted by grouping `df` by `group_cols`
    static_cols=[
        "var1"
    ],  # also extract these additional columns as static covariates (without grouping)
    value_cols=[
        "comp1",
        "comp2",
        "comp3",
    ],  # optionally, specify the time varying columns
)

Sorting by time leads to UserWarning: The (time) index fromdfis monotonically increasing. This results in time series groups with non-overlapping (time) index. You can ignore this warning if the index represents the actual index of each individual time series group.

dennisbader commented 4 weeks ago

Hi @halstonblim, we do the check regardless of whether you pass time_col specifically or not. It's a sanity check where we want to avoid that users run into pitfalls down the line.

We check whether the index is monotonically increasing (e.g. next index must be larger than or equal to the last index, assuming a sorted index) because pandas has a built-in property for this. Better would be to check whether it's strictly monotonically increasing (e.g. next index must be larger (not equal to or smaller) than last value, assuming a sorted index). But pandas doesn't have a property for that, and I don't think it's necessary to add this logic for the sanity check.

Your index is monotonically increasing but not strictly, so you have a valid index. We mention in the warning message that you can ignore it if it's actually a valid index.

I think it's fine to raise this warning but maybe we could improve the message to clarify some things (or add an ignore_warnings flag to the method). WDYT?

halstonblim commented 4 weeks ago

Thanks for clarifying @dennisbader. I think the ignore_warnings flag would be helpful!

Agree strictly monotonically increasing is a bit better. Prob not worth implementing, and I think there would still be the corner case of a single group where you would expect a monotonic time index