Rocket setting with other classifiers (not ridge) for large dataset ?

I am trying ROCKET using my own dataset, which consists of over 200K samples, with GPU (Pytorch)

Following tutorial notebook, I tried training and validation, but I'm now confused some parts of my code. Here my codes:

Let's assume that dataset is already prepared. data and label are originally divided into train and valid, respectively.

data_train, label_train    # data_train.shape = (samples, vars, lens), label_train.shape = (samples, )
data_valid, label_valid

I created var 'splits'

splits = get_predefined_splits(data_train, data_valid)

and then, I concatenated them to use 'get_ts_dls' function :)

data = concat(data_train, data_valid)
label = concat(label_train, label_valid)

HERE Q1: What is difference between 'Categorize()' and 'TSClassificaiton()' ??

tfms = [None, [Categorize()]]

HERE Q2: Because the number of my samples is too high, I cannot train them with RidgeClassfierCV. So, I'd like to use SGDClassifier or LogisticRegression, or other classifiers. In this case, should I set with 'by_var=True' here ? Or, should I set with 'by_samples=True' here, and additionally set with 'by_var=True' prior to calling classifier ??

batch_tfms = [TSStandardize(by_var=True)]

dls = get_ts_dls(data, label, splits=splits, tfms=tfms, drop_last=False, shuffle_train=False, batch_tfms=batch_tfms, bs=50_000)
model = build_ts_model(ROCKET, dls=dls)

X_train, y_train = create_rocket_features(dls.train, model)
X_valid, y_valid = create_rocket_features(dls.valid, model)

HERE Q3: related to Q2, to use other classifiers (not Ridge), where should I normalize my data per feature ? Tutorial is confusing a little.... From tutorial, original data SHOULD be normalized in the manner of 'by_samples'. And, transformed data (here, rocket features = X_train, X_valid) also SHOULD be normalized in the manner of 'by_var' here, which can be processed by 'TSStandardize(by_var=True)' or 'calculating mean, std, and manual subtraction & division'

clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42)
clf.fit(X_train, y_train)
......

I am confused that which way is correct. Could you explain and guide me?

Thank you in advance.

Hi @ratmcv, If you want to use any of the Rocket models on a large dataset (determined by the authors as >10k samples) I'd suggest you should try MiniRocket Pytorch (see tutorial). MiniRocket is faster and usually achieves better results that Rocket. And the Pytorch version doesn't have any size limitation. It calculates the features on the fly (for every batch). You can use it as any other tsai model:

dsid = 'ECGFiveDays'
X, y, splits = get_UCR_data(dsid, split_data=False)
tfms = [None, TSClassification()]
batch_tfms = TSStandardize()
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms, bs=256)
learn = ts_learner(dls, MiniRocket, metrics=accuracy)
learn.fit_one_cycle(1, 1e-2)

TSClassification is a vectorized (faster) version of Categorize.

Thank you for your reply, @oguiza !

Now, I am trying to MiniRocket as you mentioned. I have additional questions to ask you.

1) In my case, which normalization method should I use? your example code mentioned above does not use any normalization. Or... you mean that just use default setting ?

batch_tfms = TSStandardize()

2) My data is very large and also imbalanced data. Do you suggest any option for this case ? In other threads, I found the 'weight' option... does it work for me?

3) Using option 'metrics=accuracy' is always working, but when I tried 'metrics=F1Score', an error message appeared.

~/.local/lib/python3.8/site-packages/fastai/learner.py in accumulate(self, learn)
      471     def accumulate(self, learn):
      472           bs = find_bs(learn.yb)
--> 473           self.total += learn.to_detach(self.func(learn.pred, *learn.yb))*bs
      474           self.count += bs
      475      @property

TypeError: Exception occured in 'Recorder` when calling event 'after_batch' : 
unsupported operand type(s) for *: 'AccuMetric' and 'int'

I'd like to apply other options for metrics like F1-score, precision.... Which option is correct ? (My case is a binary classification or multi-class classification, not multi-label classification.)

Thank you in advance.

Hi @ratmcv, If you want to use any of the Rocket models on a large dataset (determined by the authors as >10k samples) I'd suggest you should try MiniRocket Pytorch (see tutorial). MiniRocket is faster and usually achieves better results that Rocket. And the Pytorch version doesn't have any size limitation. It calculates the features on the fly (for every batch). You can use it as any other tsai model:
dsid = 'ECGFiveDays'
X, y, splits = get_UCR_data(dsid, split_data=False)
tfms = [None, TSClassification()]
batch_tfms = TSStandardize()
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms, bs=256)
learn = ts_learner(dls, MiniRocket, metrics=accuracy)
learn.fit_one_cycle(1, 1e-2)
TSClassification is a vectorized (faster) version of Categorize.

In my case, which normalization method should I use? your example code mentioned above does not use any normalization. Or... you mean that just use default setting ?

The tutorial states this: "Online feature calculation MiniRocket can also be used online, re-calculating the features of each minibatch. In this scenario, you do not calculate fixed features one time. The online mode is a bit slower than the offline scenario but offers more flexibility. Here are some potential uses:

You can experiment with different scaling techniques (no standardization, standardize by sample, normalize, etc)."

My data is very large and also imbalanced data.

You could try passing a weight to the loss_function. tsai provides the calculated weights as dls.cws (for class weights). You can, for example, use it this way:

loss = nn.CrossEntropyLoss(weight=dls.cws)

Using option 'metrics=accuracy' is always working, but when I tried 'metrics=F1Score', an error message appeared.

You should take a look at fastai's metrics (doc). All metrics you mention come from fastai. They are not tsai metrics.

timeseriesAI / tsai

Rocket setting with other classifiers (not ridge) for large dataset ? #606