unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.92k stars 860 forks source link

Holidays not handled optimally #291

Closed aurelije closed 1 year ago

aurelije commented 3 years ago

Describe the bug Calling add_holidays on timeseries makes multivariate series.

Expected behavior Calling models that can handle holidays (eg. Prophet, ARIMA with exogenous series) would automatically use holidays series from timeseries object that would still be considered as univariate. Models that can't handle holidays could issue warning that holidays feature is not supported. Multivariate models, I guess, do not have any problem using holidays data as multivarate timeseries...

Not expected behavior Throwing exception that series is not univariate. That defeat the purpose of having Timeseries class that has information about what holiday should be considered. The details of how that info should be handled by each of specific models should be transparent, not thrown to us.

But as I could see ARIMA for example do not expose exogenous series functionality from stats models so even manually holidays info can't be used

Prophet model is exposing internal holidays functionality over constructor. What I see as a better option is instead of this in fit method:

# Input built-in country holidays
        if self.country_holidays is not None:
            self.model.add_country_holidays(self.country_holidays)

to have something like this:

        if series.country_holidays is not None:
            self.model.add_country_holidays(series.country_holidays)

System (please complete the following information):

tneuer commented 3 years ago

Hello @aurelije! Some functions like ARIMA, AutoARIMA and VARIMA will include arguments in the fit method to include exogenous variables in the next release (currently only on develop branch), where the user could include the holiday information.

Would you suggest to have a special marker on the holidays column and deal with it automatically? This might cause problems if someone changes for example the name and / or position of this column. I would argue it is easier to treat the holiday column as any other exogenous variable instead of treating it separately. If a model is used which can only handle univariate input I think it makes sense to raise an error for multivariate input as we don't want to guess which of the passed columns is the target column. But I think I see your point that currently the holiday column is not really used optimally as only Prophet and some multivariate model can use the information, but in the next release it should also be usable for all the ARIMA functions. So I think it makes sense like it is or would you disagree?

aurelije commented 3 years ago

@tneuer honestly I am just a poor newbie in this domain, so I am just exploring different options for doing time series analysis in Python because I am or at least used to be java developer that likes python, knows pandas, likes ML... So R is not preferred by me.

I know a bit of AutoARIMA and FBProphet but I am opened also to other algorithms and libraries. I see this library, sktime and hcrystalball as the most promising. Maybe I got it wrong but I taught that the purpose of this library is to make your time series analysis simpler, otherwise I already can do everything using directly pmdarima, statsmodels, fbprophet... But then no simple interface, data input different for each of them.... lot of boiler code to have comparison or ensemble of different algorithms.

Having specialized type for timeseries instead of plain Pandas dataframes goes into that direction. So telling from dumb user position, but also telling from object-oriented analysis and design point of view it make sense that all info about timeseries goes in that class (so also other attributes of TS as seasonality should be a part of TS object not of each model). Then that plain information will be used by each of possible algorithms in algorithm specific way.

Example: you create time series object. You know that it contains data for one specific market (let we say customer's demand for some product). Because you know region of market (France for example) you know or at least speculate that holidays have influence on demand. Time series should have a method that you would use to turn on holiday awareness for your series object. It could be just a simple field in ts object called use_holidays_for and then you set it to "FR" or other ISO code for state, possible another field for province... Then it is up to concrete algorithm to use that info.

Fb prophet wrapper will just use the value of that field to call add_country_holidays on embedded model object. Arima and company would at that point generate holidays 0/1 series (like add_holidays on timeseries is currently doing) and pass it as a external regressor.

Anyhow the concrete implementation of this feature is not my concern as long as it do the right thing for the end user. Also s should concrete model throw an exception or warning if holidays are not supported at all is something for discussion but I am always for flexibility given to the end user (maybe some config option to override default behavior?)

What would be the gain? Simple interface, one place to turn on and off holidays for all (or at least all that can support holiday no matter how), and rerun models with one line change and evaluate gain from holidays.

How did I got into this trouble? This is my use case: I was trying https://unit8co.github.io/darts/examples/01-darts-intro.html example notebook, and in the end I wanted to see if there is support for holidays. My expectation was that I can just call one method on ts object to announce that I would turn on holidays. I saw the method that looks like the thing I was looking for (add_holidays), first rerun of the notebook example and I saw no changes in accuracy of any model... Then I figured out that I have to get the output TS from add_holidays and use it in the models. Then Naive model failed, I commented it out, then next model failed, and next. I was really, really surprised when even FBProphet failed and it has a built-in holidays support!!! Then I looked at the implementation and came here to open this request.

hrzn commented 3 years ago

Hi @aurelije, you are right that this might be a little confusing, and I like the overall approach that you propose (embedding a sort of "automatic" holiday metadata inside TimeSeries - the country code etc - which can be re-used by whatever models which are set to take holidays into account). It will take a bit of work and we have quite a few other features we want to implement, but we'll put it on the roadmap.

hrzn commented 1 year ago

I think this is not so relevant anymore. Models accepting exogenous data can use holiday TimeSeries as covariates, and multivariate models can use holiday dimensions. I don't think there'd be a strong benefit in having some kind of hidden holiday dimension in TimeSeries.