johannfaouzi / pyts

A Python package for time series classification
https://pyts.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.75k stars 163 forks source link

WEASEL Implementation #77

Open arainboldt opened 4 years ago

arainboldt commented 4 years ago

I don't have a bug or issue to report. Just some questions. First, the word-bank size for the Weasel transformation is limited to 26. Apparently this is because there are 26 letters in the English alphabet, but this seems like a rather arbitrary limitation. Why was this designed this way? Secondly, is it possible to serialize a Weasel transformer? suppose I have a large time-series that takes a while to model. Shouldn't there be a warm-start process?, best if it's serializable to str or json, but even pickle would be nice. if it is let me know!

Thanks for the great package!

Andrew

johannfaouzi commented 4 years ago

Hi,

Thanks for your kind words.

Regarding your first question, the limit on the word size is indeed arbitrary. However, this limit is already much higher of what is used in practice. Citing the WEASEL paper:

We fix the alphabet size c to 4, as it has been shown in the context of BOP models that using a constant c = 4 is very robust over all TS considered

Let n_bins be the alphabet size word_size be the number of symbols in each word. Weasel considers single words and pairs of words, which means that:

And you do this for every window, which means that the maximum number of extracted features is:

The alphabet size n_bins corresponds to the number of bins used to discretize the Fourier coefficients (see the figure below taken from BOSS paper):

Capture d’écran 2020-07-07 à 19 42 04

Obviously not all the possible words will actually be present, but you increase the list of possible words, so you will increase the list of unique words, and the frequency of each word is lower. So you get words that are more specific to each time series. If you get words that are too specific, you are just overfitting the training set. Note that WEASEL is fitted on the training set and applied on the training and test sets (you learn the vocabulary on the training set and you compute the frequency of each word for each time series in the training and test sets).

Regarding your second question, WEASEL is a bit serializable. The transformation is performed for each window (characterized by the window_sizes and window_steps parameters) independently. So, if you have a large dataset, you can perform the transformation for each window independently and save each estimator. However, there is no warm starting possible (WEASEL does not update coefficients like a linear model and does not add new estimators like a random forest or gradient tree boosting). I'm not an expert in serialization, but I known that for scikit-learn like estimators, it's usually done using joblib or pickle. I think it should be serializable in json, all you need to define the estimator is:

Minimal working example:

from joblib import dump
import numpy as np
from pyts.datasets import load_gunpoint
from pyts.transformation import WEASEL

# Load the GunPoint dataset
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)

# Define the parameters
common_params = {'sparse': False, 'chi2_threshold': 1.2}
window_sizes = [0.1, 0.3, 0.5, 0.7, 0.9]

# Compute the transformation all in once
weasel = WEASEL(**common_params, window_sizes=window_sizes)
X_train_weasel_1 = weasel.fit_transform(X_train, y_train)
X_test_weasel_1 = weasel.transform(X_test)

# Compute the transformation for each window independently
X_train_weasel_2, X_test_weasel_2 = [], []
for window_size in window_sizes:
    weasel_ind = WEASEL(**common_params, window_sizes=[window_size])
    X_train_weasel_2.append(weasel_ind.fit_transform(X_train, y_train))
    X_test_weasel_2.append(weasel_ind.transform(X_test))

    # You can save the estimator (or both arrays if you prefer)
    dump(weasel_ind, f'weasel_{window_size}.joblib')

X_train_weasel_2 = np.concatenate(X_train_weasel_2, axis=1)
X_test_weasel_2 = np.concatenate(X_test_weasel_2, axis=1)

# Check that the arrays are identical
np.testing.assert_equal(X_train_weasel_1, X_train_weasel_2)
np.testing.assert_equal(X_test_weasel_1, X_test_weasel_2)

Hope this helps you a bit.

avm19 commented 2 years ago

Dear @johannfaouzi ,

First, thanks for your implementation and for the recent review paper. My question is somewhat relevant to this discussion.

As far as I understand, in principle, n_bins (the alphabet size) can be set for each sliding window independently. And it looks like the implementation of the list-like n_bins would require minor changes to the code. As long as the window sizes are unique, there is even no need to resolve collisions like abc from the 3-letter alphabet vs abc from the 4-letter alphabet. Am I missing anything? I could of course use two or more instances of WEASELMUSE for that and then combine the results.

Also, do you think it make sense to have a finer quantisation for larger windows? Could you share your intuition on this? Thanks!

johannfaouzi commented 2 years ago

Hi,

As far as I understand, in principle, n_bins (the alphabet size) can be set for each sliding window independently. And it looks like the implementation of the list-like n_bins would require minor changes to the code. As long as the window sizes are unique, there is even no need to resolve collisions like abc from the 3-letter alphabet vs abc from the 4-letter alphabet. Am I missing anything? I could of course use two or more instances of WEASELMUSE for that and then combine the results.

n_bins could indeed be set for each sliding window independently because the features corresponding to each (pair of) word(s) are considered different for different sliding windows: aba with a sliding window of 12 is a different feature from aba with a sliding window of 16. However, the current implementation assumes a single value for n_bins, so it would require to have indeed unique values for the window sizes. A more general solution would be to include the value of n_bins in the feature name to differentiate same words with the same window size but with a different alphabet size.

I would consider abc from the 3-letter alphabet different from abc from the 4-letter alphabet because the corresponding quantiles will be different, and generally you should get different features.

With the current version of code, you would have to do it "manually" by indeed creating several independent instances of WEASELMUSE and combine the results.

Also, do you think it make sense to have a finer quantisation for larger windows? Could you share your intuition on this?

Well, it depends. Having larger sliding windows implies having a lower number of windows for each time series (and thus fewer samples). Finer quantization usually implies that you have more samples, so you can afford a better estimation of the quantiles. Moreover, the number of possible words increase exponentially with the alphabet size: for any window size, the theoretical maximum number of features is n_bins ** word_size + n_bins ** (2 * word_size). Therefore, in my opinion, if you want to increase the alphabet size, you need more windows (the total number of windows is equal to the number of time series times the number of windows extracted for each time series). So you need:

Otherwise you will get features that are extremely sparse (some words being possibly present in only one time series), which you don't want: if the words for the time series in the test set are never present in the training set, you will probably get a really bad predictive performance. In my opinion, this is very related to overfitting/underfitting: you want to extract features that are actually useful (not too specific to each time series - overfitting, not too generic - underfitting).

I'm not perfectly sure of the literature, but in most cases, I think that the authors set it to 4, which is probably a good default value given the size of most the datasets. You can try more bins, but I strongly encourage you to check the sparsity of the matrix that you get at the end, like the percentage of non-zero values for each feature: if it's too high, it means that the features are probably too specific of each time series, and I think that you will face overfitting in this case.

Hope this helps you a bit.