Add Module for Statistical Tests

RNKuhns commented 3 years ago

Is your feature request related to a problem? Please describe. Statistical tests have common use-cases in timeseries analysis, including inspect the properties of a timeseries (e.g. stationarity testing, checking normality) to guide modeling decisions and also to evaluate model output (including evaluating quality of forecasts).

Adding a interface for statistical tests will allow Sktime to add relevant functionality in this area. In addition to the tests themselves, this will help enable conditional transformers (think differencing a series if it is non-stationary or using a BoxCox transform based on results of normality or conditional variance test) and post-hoc forecast evaluation/benchmarking (Diebold-Mariano and other tests of one set of forecasts against another).

Describe the solution you'd like A interface and module for statistical tests in Sktime. The module's base would include a class that will be the basis for all tests.

My thoughts on the interface are generally:

Tests will be estimators (they need to be fitted)
Instead of transform or predict they should have a report method that returns test results
- Reported results should be standardized regardless of test and I think they should be p-value, test statistic, and whether null was rejected (all tests would have hyper-parameter report_detail that defaults to True and reports all three items. But if it is set to False only whether the null was rejected would be reported). Note if a test doesn't have a p-value or test statistic that part of return will be None for that test.
Plan to follow logic of Forecaster refactor and keep the code in non-public methods standardized across sub-classes

Proposed BaseStatisticalTest is presented below.

class BaseStatisticalTest(BaseEstimator):

    def __init__(
        test_hyper_parameteres..., 
        hypothesis="two-sided",
        report_detail=True
    ):
        ...
        self.p_value = None
        self.test_statisic = None
        self.reject_null = None

    def _fit(Y, X=None):
        """ Logic to fit each test."""
       ...
       #assume things below are calculated in _fit above
       self.p_value = p_value
       self.test_statistic = test_statistic 
       self.reject_null = reject_null
       return self

    def fit(Y, X=None):
        """Would remain same in each Test's class."""
        ...
        # Input checks, etc happen above
        return self._fit(Y, X=X)

    def _report()
        """Logic to return the information to from report.

        Returns
        --------
        Plan is to return just boolean reject_null if hyper-parameter `report_detail=False`
        Otherwise will report the following:
        p_value : float or None
            P-value associated with statistical test. If no p-value is 
            available for a test then will return None.
        test_statistic : float or None
            Test statistic from the statistical test. If no test 
            statistic is available for a test then will return None.
        reject_null : bool
            Whether the Test's Null Hypothesis was rejected.
        """
        ...
        if report_detail:
            return self.p_value, self.test_statistic, self.reject_null
        else:
            return self.reject_null

    def report():
        """Would be same for every test."""
        self.check_is_fitted()
        ...
        return self._report()

    def fit_report(Y, X=None):
        """Would be same for every test."""
        return self.fit(Y, X=X).report()

    def print_results():
        """Pretty printing the test hyperparameters, timeseries being tested and results."""
        ...
        return None

    def results_to_pandas():
        """Output results to a pd.DataFrame.

        Useful when you want to apply a test to many series and capture 
        the results.

        Returns
        -------
        results_df : pd.DataFrame
            DataFrame containing results in standardized format.
        """

        return results_df

    def results_to_excel(...):
        """Output results to an excel file.

        Useful when you want to apply a test to many series and capture 
        the results and store the results on disk incrementally (potentially
        if applying to many series and you want to ensure results are saved even
        if workflow gets stopped (so you don't have to start at beginning).

        Returns
        -------
        None
        """
        self.results_to_pandas().to_excel(...)
        return None

    def results_to_csv(...):
        """Output results to an csv file.

        Useful when you want to apply a test to many series and capture 
        the results and store the results on disk incrementally (potentially
        if applying to many series and you want to ensure results are saved even
        if workflow gets stopped (so you don't have to start at beginning).

        Returns
        -------
        None
        """
        self.results_to_pandas().to_csv(...)
        return None

Note that I'm open on design details, particularly the naming conventions (I don't have strong feelings about use of report, results, or something else) and likewise if we want to call this something other than statistical tests that is fine too).

The main outstanding questions (other then general feedback) involve around the interface for accepting different types of input that works across a range of tests.

This needs to cover:

Univariate diagnostic tests of timeseries "properties" (e.g. normality, stationarity, auto-correlation, etc)
Multivariate diagnostic tests (e.g. Granger causality or cointegration)
Panel diagnostic tests (panel extension of stationarity tests, etc)
Post-hoc tests of one set of forecasts (whether they be univariate, multivariate or potentially panel) against another ("Y_Other")

Initial thoughts to solve this would be for fit to accept either a pd.Series, pd.DataFrame or NumPy array "Y" and optionally accept exogenous data "X" (some tests will use this others won't) and determine how to proceed based on the type of tests.

For "univariate" diagnostic tests, if a pd.Series is received the test would be done on the series (optionally using X if it is needed). If a pd.DataFrame is received the test would be applied to each series in the DataFrame.
- Should we return an error if panel data is received? I'm leaning this way for now but later we could run test for each series for each panel member.
For a "multivariate" diagnostic test if a pd.Series (or single series DataFrame or NumPy array) then an appropriate message is raised; otherwise tests uses multivariate data from the DataFrame or NumPy array.
Panel tests would likewise raise error if appropriate panel data format not received and otherwise run as expected.
In all cases, we'd have error checking to make sure the X data is of the correct dimension to be used with the Y data

This leaves the last piece, which is how to accept the data for post-hoc tests. Note that these tests often can be applied to univariate data, while an extension allows them to be applied to multivariate data. I'd propose we don't want separate classes based on that distinction. Instead, I propose the following logic:

Optionally pass "Y_Other" in fit (kind of like how we handle y_train in performance metrics). If Y_other is received we check its dimension against Y and assume we are doing a post-hoc comparison of Y against "Y_Other"
If "Y_Other" is not passed and a pd.Series is received then raise an error (you'd have nothing to compare series against)
If "Y_Other" is not passed and Y is a pd.DataFrame then make assumptions about its structure and proceed with test (e.g. if it has 2 columns test column 1 against column 2).

Note that I will edit this comment later to add a list of tests that can be interfaced (primarily from statsmodels) and a set of tests we'd need to write ourselves.

Plan would be to chunk this out in phases:

Decide on framework and implement BaseStatisticalTest and unit tests
Have issue with checklist of good first issues for interfacing tests in Statsmodels (and possibly elsewhere if we can avoid adding un-needed additional dependencies)
Create issue with checklist of tests we need to code ourselves (as of now these are mostly post-hoc tests and some boutique extensions of diagnostic tests)

Describe alternatives you've considered An alternative I've considered is to import and use tests from other packages (Statsmodels) when available. But there are tests not in Statsmodels that we should add (post-hoc forecast evaluation ones in particular). Having a common interface that can be used to adapt Statsmodels tests to our format and also be used for our own statistical tests seems like the way to go to me for uniformity.

Note that in terms of the interface, one consideration I've had is whether to have a separate base class for post-hoc tests and diagnostic tests. Main difference is interface for fit as the diagnostic tests don't need to worry about "Y_Other".

TonyBagnall commented 3 years ago

this is a great idea, we need a few tests for evaluation, wilcoxon sign rank and a couple of others, and I prefer to have bespoke implementations.

TonyBagnall commented 3 years ago

just pinging this here, as it is dependent on hypothesis tests which we can work into this package https://github.com/alan-turing-institute/sktime/issues/1186

fkiraly commented 9 months ago

Abandoned and superseded by the parameter estimator module - which follows similar ideas but uses get_fitted_params instead of report.

sktime / sktime

Add Module for Statistical Tests #1175