samples with different length

thunderbug1 commented 3 years ago

If I understand the WEASEL+MUSE algorithm correctly it should be possible to use it with samples of different lengths. This is currently not possible with the API of the WEASELMUSE class which expects a 3d array in the shape = (n_samples, n_features, n_timestamps) since a numpy array has the same shape for all samples.

I tried to fill the time series of all samples to the length of the longest samples with nan values, but the input checks reject nan values. Is there a way to achieve using samples of different lengths?

johannfaouzi commented 3 years ago

Hi,

Sorry for the late reply. Support for variable-length data sets is unfortunately not supported for the moment.

Regarding WEASEL+MUSE, you can achieve this with the following process:

Create a data set for each unique length value (in each data set, the time series should have the same length)
Transform each data set using a separate instance of WEASELMUSE (set chi2_threshold to a very low positive value in order to not perform feature selection)
Concatenate the transformed data set (the pandas package is handy for this)
Perform feature selection on the concatenated data set

The main downside of this approach is the high memory (RAM) usage because the feature selection is performed at the last step. A possible solution (that would lead to the same results) would be to use a for loop for the window_sizes parameters (instead of setting a list with k window sizes, you create a for loop (on the window sizes) and provide a single window size inside the for loop).

Here is an example (without the aforementioned optimization, I can modify the example to show you if needed):

import numpy as np
import matplotlib.pyplot as plt
from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import WEASELMUSE
import pandas as pd
from sklearn.feature_selection import chi2

#######################
####### D A T A #######
#######################

# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# X_train.shape = X_test.shape = (40, 6, 100)

# Sample 4 random lengths between in the interval [80, 100]
rng = np.random.RandomState(42)
lengths = 80 + rng.choice(21, size=4, replace=False)

# Assign 10 time series to each length
lengths_samples_train_idx = rng.permutation(40).reshape((4, 10))
lengths_samples_test_idx = rng.permutation(40).reshape((4, 10))

#######################
# P A R A M E T E R S #
#######################

# WEASEL+MUSE parameters
weasel_muse_params = {'word_size': 2, 'n_bins':2, 'window_sizes': [12, 36],
                      'chi2_threshold': 1e-80}
transformer_list = [WEASELMUSE(**weasel_muse_params) for _ in range(4)]

#######################
### T R A I N I N G ###
#######################

X_weasel_train = []
for samples_idx, length, transformer in zip(lengths_samples_train_idx, lengths, transformer_list):
    X_weasel_train.append(transformer.fit_transform(X_train[samples_idx, :, :length], y_train[samples_idx]))

# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_train = pd.concat([
    pd.DataFrame.sparse.from_spmatrix(
        X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
    )
    for X, samples_idx, transformer in zip(X_weasel_train, lengths_samples_train_idx, transformer_list)
]).fillna(0.)

# Perform feature selection using chi2 test
chi2_threshold = 2.
chi2_statistics, _ = chi2(df_weasel_train, y_train)
features_idx_to_keep = np.where(chi2_statistics > chi2_threshold)[0]
features_to_keep = df_weasel_train.columns[features_idx_to_keep]
df_weasel_train = df_weasel_train[features_to_keep]

#######################
## I N F E R E N C E ##
#######################

X_weasel_test = []
for samples_idx, length, transformer in zip(lengths_samples_test_idx, lengths, transformer_list):
    X_weasel_test.append(transformer.transform(X_test[samples_idx, :, :length]))

# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_test = pd.concat([
    pd.DataFrame.sparse.from_spmatrix(
        X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
    )
    for X, samples_idx, transformer in zip(X_weasel_test, lengths_samples_test_idx, transformer_list)
]).fillna(0.)[features_to_keep]

Let me know if this helps you.

thunderbug1 commented 3 years ago

oh wow, thanks for the extensive example. I wouldn't have considered using separate instances of WEASELMUSE but it makes sense. I will give it a try

johannfaouzi / pyts

samples with different length #104