Unwindowed datasets: some clarification required

Hi @oguiza

I am trying to solve the binary classification problem using tsai. I have a kind-a large dataset. I cannot use apply_sliding_window directly on it because I run into OOM. That is why I am now trying to use the TSUnwindowedDataset[s] routines. And I doubt several points whether I'm doing the right thing.

For the following example I took just a part of full dataset, so the shape of this slice does not really matter, it is just FYI.

df.shape
# (2358720, 7)

Now I extract features and target from the slice

X = df.drop(columns=['time', 'target']).values
y = df['target'].values
type(X), type(y), X.shape, y.shape
# (numpy.ndarray, numpy.ndarray, (2358720, 5), (2358720,))

Checking the target is binary:

pd.Series(y).value_counts()
# 0    1829729
# 1     528991
# Name: count, dtype: int64

Now the tsai library kicks in:

computer_setup()
# os              : Linux-5.9.16-050916-lowlatency-x86_64-with-glibc2.31
# python          : 3.12.3
# tsai            : 0.3.9
# fastai          : 2.7.15
# fastcore        : 1.5.38
# torch           : 2.2.2+cu121
# device          : 1 gpu (['NVIDIA GeForce GTX 1080 Ti'])
# cpu cores       : 16
# threads per cpu : 2
# RAM             : 31.31 GB
# GPU memory      : [11.0] GB

After I've created splits, I create instances of the TSUnwindowedDataset and TSUnwindowedDatasets classes:

WINDOW_SIZE = 50

def my_y_func(y_):
    return y_[:,-1] # I need only the last item from the window of targets

ds = TSUnwindowedDataset(X=X, y=y, y_func=my_y_func, window_size=WINDOW_SIZE, seq_first=True)

dsets = TSUnwindowedDatasets(ds, splits=splits)

dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, dsets[2], # incuding test part of dataset
                                  bs=256, 
                                  shuffle_train=False,
                                  batch_tfms=TSStandardize(by_sample=True)
                                  )

and here is the first point:

dls.vars, dls.c
# (5, 1)
#     ^
#     expected 2 for binary classification

The class count is 1 instead of expected 2 for binary classification. If I try to create model and train it

model = TST(dls.vars, dls.c, dls.len, dropout=0.3, fc_dropout=0.3)

cbs = [
    # does not matter
]

learn = Learner(dls, model, metrics=[RocAucBinary(), accuracy], cbs=cbs)
learn.lr_find()

I get following error:

../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.

This approach differs from the sample notebook, where a transformation is used for the target:

# .....
tfms  = [None, [Categorize()]] # <---------- makes `dls.c` equal to `2`
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)
# .....

However, the TSUnwindowedDataset does not have such functionality.

How to properly introduce the target to the data loader in that case?

As a temporary solution, I have tried to train model like this:

model = TST(dls.vars, max(2, dls.c), dls.len, dropout=0.3, fc_dropout=0.3)
#                            ^
cbs = [
    # does not matter
]

learn = Learner(dls, model, metrics=[RocAucBinary(), accuracy], cbs=cbs)
learn.fit_one_cycle(100, 1e-4)

This code trains the model and I even get pretty good-looking charts at the end Xnip2024-05-31_19-32-13

But here is the second point: I don't know how to properly interpret predictions.

probas, *_, labels = learner.get_preds(dl=dls.valid, with_decoded=True)
labels_ = probas.argmax(dim=1)
test_eq(labels_, labels)     # OK

As for my target - y[i] == 1 is good and 0 is bad But what does label[i] == 1 mean? It can mean the same as my target, but since the prediction return probabilities of shape (N, 2) I suspect it means the opposite.

So to check it I've created a method:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

def check_predictions(dls, learner, idx=None, invert:bool = False):
    dl = dls.valid if idx is None else dls[idx]
    probas, *_, labels = learner.get_preds(dl=dl, with_decoded=True)

    y_true = dl.y[dl.split]
    y_pred = labels.cpu().numpy()
    if invert:
        y_pred = 1 - y_pred

    print('ROC_AUC:   ', roc_auc_score(y_true, y_pred))
    print('F1:        ', f1_score(y_true, y_pred))
    print('Accuracy:  ', accuracy_score(y_true, y_pred))
    print('Precision: ', precision_score(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

...and tried both ways

check_predictions(dls, learn, invert=False) # use predicted labels as they are
# ROC_AUC:    0.5123604793431208
# F1:         0.48825168804096614
# Accuracy:   0.41867109378016293
# Precision:  0.34949275421082937
# [[21967 80216]
#  [10126 43097]]

check_predictions(dls, learn, invert=True)
# ROC_AUC:    0.4876395206568792
# F1:         0.23737634206948285
# Accuracy:   0.5813289062198371
# Precision:  0.31552051849312934
# [[80216 21967]
#  [43097 10126]]

And here is the third point - I cannot reproduce validation ROC AUC score anywhere near displayed on the chat. In both ways I compare predicted labels to my target on the validation subset - I get ROC AUC ~0.5, but the chart shows 0.75 Why that happens? What am I missing?

timeseriesAI / tsai

Unwindowed datasets: some clarification required #908