timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.07k stars 633 forks source link

0.3.5 version error: exception occured in `Recorder` when calling event `after_batch` #703

Closed jrfackler closed 1 year ago

jrfackler commented 1 year ago

Hello,

Any anyone seen this error with the current version of tsai?

This error only occurs when using version 0.3.5. The previous version 0.3.4 does not produce the error. A quick google search shows that this may be related to metrics but I have metrics=None I'm using a custom loss function

n_epochs = n_epochs tfms = [None, [TSMaskOut(magnitude=0.0)]] dsets = TSDatasets(X, y, tfms=tfms, splits=splits, inplace=True) dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, splits=splits, tfms=tfms, inplace=True, bs=[96, val_size], batch_tfms=None) del dsets gc.collect()

model = TSTPlus(dls.vars, 25, dls.len, d_model=512, d_ff=512, n_layers=n_layers, n_heads=n_heads, dropout=dropout, pe=pe, fc_dropout=fc_dropout, attn_dropout=0.0)

learn = Learner(dls, model, loss_func=CustomLoss1(), metrics=None, opt_func=SGD, cbs=ShowGraphCallback2())

start = time.time()

with ContextManagers([learn.no_logging(), learn.no_bar()]): learn.fit_one_cycle(n_epochs, lr_max=lr_max)

This is the error stack:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [37], in <cell line: 18>()
     16 start = time.time()
     18 with ContextManagers([learn.no_logging(), learn.no_bar()]): 
---> 19     learn.fit_one_cycle(n_epochs, lr_max=lr_max)
     20 print('\nElapsed time:', time.time() - start)
     21 learn.plot_metrics()

File /usr/local/lib/python3.9/dist-packages/fastai/callback/schedule.py:119, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch)
    116 lr_max = np.array([h['lr'] for h in self.opt.hypers])
    117 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    118           'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 119 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:264, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    262 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    263 self.n_epoch = n_epoch
--> 264 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:253, in Learner._do_fit(self)
    251 for epoch in range(self.n_epoch):
    252     self.epoch=epoch
--> 253     self._with_events(self._do_epoch, 'epoch', CancelEpochException)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:247, in Learner._do_epoch(self)
    246 def _do_epoch(self):
--> 247     self._do_epoch_train()
    248     self._do_epoch_validate()

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:239, in Learner._do_epoch_train(self)
    237 def _do_epoch_train(self):
    238     self.dl = self.dls.train
--> 239     self._with_events(self.all_batches, 'train', CancelTrainException)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:205, in Learner.all_batches(self)
    203 def all_batches(self):
    204     self.n_iter = len(self.dl)
--> 205     for o in enumerate(self.dl): self.one_batch(*o)

File /usr/local/lib/python3.9/dist-packages/tsai/learner.py:40, in one_batch(self, i, b)
     38 b_on_device = to_device(b, device=self.dls.device) if self.dls.device is not None else b
     39 self._split(b_on_device)
---> 40 self._with_events(self._do_one_batch, 'batch', CancelBatchException)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:201, in Learner._with_events(self, f, event_type, ex, final)
    199 try: self(f'before_{event_type}');  f()
    200 except ex: self(f'after_cancel_{event_type}')
--> 201 self(f'after_{event_type}');  final()

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:172, in Learner.__call__(self, event_name)
--> 172 def __call__(self, event_name): L(event_name).map(self._call_one)

File /usr/local/lib/python3.9/dist-packages/fastcore/foundation.py:156, in L.map(self, f, *args, **kwargs)
--> 156 def map(self, f, *args, **kwargs): return self._new(map_ex(self, f, *args, gen=False, **kwargs))

File /usr/local/lib/python3.9/dist-packages/fastcore/basics.py:840, in map_ex(iterable, f, gen, *args, **kwargs)
    838 res = map(g, iterable)
    839 if gen: return res
--> 840 return list(res)

File /usr/local/lib/python3.9/dist-packages/fastcore/basics.py:825, in bind.__call__(self, *args, **kwargs)
    823     if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    824 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 825 return self.func(*fargs, **kwargs)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:176, in Learner._call_one(self, event_name)
    174 def _call_one(self, event_name):
    175     if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 176     for cb in self.cbs.sorted('order'): cb(event_name)

File /usr/local/lib/python3.9/dist-packages/fastai/callback/core.py:62, in Callback.__call__(self, event_name)
     60     try: res = getcallable(self, event_name)()
     61     except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
---> 62     except Exception as e: raise modify_exception(e, f'Exception occured in `{self.__class__.__name__}` when calling event `{event_name}`:\n\t{e.args[0]}', replace=True)
     63 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     64 return res

File /usr/local/lib/python3.9/dist-packages/fastai/callback/core.py:60, in Callback.__call__(self, event_name)
     58 res = None
     59 if self.run and _run: 
---> 60     try: res = getcallable(self, event_name)()
     61     except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
     62     except Exception as e: raise modify_exception(e, f'Exception occured in `{self.__class__.__name__}` when calling event `{event_name}`:\n\t{e.args[0]}', replace=True)

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:560, in Recorder.after_batch(self)
    558 if len(self.yb) == 0: return
    559 mets = self._train_mets if self.training else self._valid_mets
--> 560 for met in mets: met.accumulate(self.learn)
    561 if not self.training: return
    562 self.lrs.append(self.opt.hypers[-1]['lr'])

File /usr/local/lib/python3.9/dist-packages/fastai/learner.py:509, in AvgSmoothLoss.accumulate(self, learn)
    507 def accumulate(self, learn):
    508     self.count += 1
--> 509     self.val = torch.lerp(to_detach(learn.loss.mean()), self.val, self.beta)

File /usr/local/lib/python3.9/dist-packages/fastai/torch_core.py:372, in TensorBase.__torch_function__(cls, func, types, args, kwargs)
    370 if cls.debug and func.__name__ not in ('__str__','__repr__'): print(func, types, args, kwargs)
    371 if _torch_handled(args, cls._opt, func): types = (torch.Tensor,)
--> 372 res = super().__torch_function__(func, types, args, ifnone(kwargs, {}))
    373 dict_objs = _find_args(args) if args else _find_args(list(kwargs.values()))
    374 if issubclass(type(res),TensorBase) and dict_objs: res.set_meta(dict_objs[0],as_copy=True)

File /usr/local/lib/python3.9/dist-packages/torch/_tensor.py:1121, in Tensor.__torch_function__(cls, func, types, args, kwargs)
   1118     return NotImplemented
   1120 with _C.DisableTorchFunction():
-> 1121     ret = func(*args, **kwargs)
   1122     if func in get_default_nowrap_functions():
   1123         return ret

RuntimeError: Exception occured in `Recorder` when calling event `after_batch`:
    expected dtype double for `end` but got dtype float
oguiza commented 1 year ago

Hi @jrfackler, I'm not exactly sure what you are trying to achieve with tfms = [None, [TSMaskOut(magnitude=0.0)]], but that's an incorrect of of a batch transform. If your task is a classification task, you should use tfms=[None, TSClassification()]. That way you will transform the dependent variable (y) into an int with x # classes. You can pass a vocab if you need it. TSMaskOut is a batch transform, that is meant to be applied to the independent variables (X). Any transform with a magnitude of 0 has no effect. Please, let me know if that solves your issue.

jrfackler commented 1 year ago

Hi @oguiza,

Thanks for the reply and suggestion. That piece of code was an old relic I think I used to mask out variables in a non-Classification problem but it didn't improve the results so I had set the magnitude to 0.0 and forgotten about it.

I removed that code and I still get the same error when running on version 0.3.5. If I install 0.3.4 and then 0.3.5 on top of it there is no error.

But with just 0.3.5, I get the same error.

When I google search the error, the only thing I see is that it could be related to metrics. I'm not doing an accuracy problem so I have it set to None and don't want any metrics. Could that be an issue with the new version of tsai?

This is the modified code with tfms removed

n_epochs = n_epochs
dsets = TSDatasets(X, y, splits=splits, inplace=True)
dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid, splits=splits, inplace=True, bs=[96, val_size])
del dsets
gc.collect()

model = TSTPlus(dls.vars, 25, dls.len, d_model=512, d_ff=512, n_layers=n_layers,
           n_heads=n_heads, dropout=dropout, pe=pe, fc_dropout=fc_dropout, attn_dropout=0.0)

learn = Learner(dls, model, loss_func=CustomLoss1(), metrics=None, opt_func=SGD, cbs=ShowGraphCallback2())

start = time.time()

with ContextManagers([learn.no_logging(), learn.no_bar()]): 
    learn.fit_one_cycle(n_epochs, lr_max=lr_max)
oguiza commented 1 year ago

Hi @jrfackler, I've tried to reproduce the error but failed. I'm not sure what the task it: classification, regression, forecasting, ... I don't know what X and y dtypes. But you are not using any transform. Can you please use check_data(X, y, splits, show_plot=False) and share the output?

jrfackler commented 1 year ago

Hi @oguiza,

This the output from check_data(X, y, splits, show_plot=False):

X      - shape: [99 samples x 300 features x 35 timesteps]  type: ndarray  dtype:float64  isnan: 0
y      - shape: (99, 25)  type: ndarray  dtype:float64  isnan: 0
splits - n_splits: 2 shape: [79, 20]  overlap: False

I'm attaching a link to X and y datasets (doesn't seem like I can attach the datasets directly to this comment) and the code is below to reproduce the error if that helps. It works fine and completes with 0.3.4 but doesn't with 0.3.5.

link to folder with datasets

Thank you very much for looking at this.

#!pip install tsai==0.3.4
!pip install tsai==0.3.5

from tsai.all import *
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import gc
%matplotlib inline

my_setup()

X_dataset = np.load('/notebooks/X_dataset_test.npy')
Y_dataset = np.load('/notebooks/y_dataset_test.npy')
print(X_dataset.shape)

samples = X_dataset.shape[0]
n_split = round(0.8 * samples)
X_train = X_dataset[0:n_split, :, :]
y_train = Y_dataset[0:n_split]
X_valid = X_dataset[n_split:, :, :]
y_valid = Y_dataset[n_split:]

print(X_dataset.shape)
print(Y_dataset.shape)
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)
X, y, splits = combine_split_data([X_train, X_valid], [y_train, y_valid])

class CustomLoss1(nn.Module):

    def __init__(self):
        super().__init__()

    def forward(self, input: Tensor, target: Tensor) -> Tensor:
        abs_input = torch.abs(input)
        sum_input = torch.sum(abs_input, dim=1, keepdim=True)
        scaled_input = torch.div(input, sum_input)
        values = torch.mul(scaled_input, target)
        dp = torch.sum(values, dim=1)
        adp = torch.mean(dp)
        neg_days = torch.lt(dp, 0.0)
        np = torch.mul(dp, neg_days)
        nps = torch.square(np)
        npa = torch.mean(nps)
        neg_stdev = torch.sqrt(npa)
        neg_stdev_percent = torch.mul(neg_stdev, 100.0)
        adp_plus1 = torch.add(adp, 1.0)
        yearly_total = torch.pow(adp_plus1, 252.0)
        yp = torch.sub(yearly_total, 1.0)
        yearly_percent = torch.mul(yp, -100.0)
        yearly_neg_stdev = torch.pow(torch.mul(neg_stdev_percent, 19.105), 1.0)
        loss = torch.add(yearly_percent, yearly_neg_stdev)

        return loss

n_epochs = 10
n_layers = 2
n_heads = 8
dropout = 0.3
fc_dropout = 0.8
lr_max = 3.5e-7
pe = 'sincos'
val_size = X_valid.shape[0]

n_epochs = n_epochs
dsets = TSDatasets(X, y, splits=splits, inplace=True)
dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid, splits=splits, inplace=True, bs=[96, val_size])

model = TSTPlus(dls.vars, 25, dls.len, d_model=512, d_ff=512, n_layers=n_layers,
           n_heads=n_heads, dropout=dropout, pe=pe, fc_dropout=fc_dropout, attn_dropout=0.0)

learn = Learner(dls, model, loss_func=CustomLoss1(), metrics=None, opt_func=SGD, cbs=ShowGraphCallback2())

start = time.time()

check_data(X, y, splits, show_plot=False)

with ContextManagers([learn.no_logging(), learn.no_bar()]): 
    learn.fit_one_cycle(n_epochs, lr_max=lr_max)
oguiza commented 1 year ago

Hi @jrfackler, You are using a classification task with 25 classes, but y is a float. That is an issue. You need to convert y to int. That is handled by a tfm (TSClassification()). You can adapt (and simplify) your code using something like:

tfms = [None, TSClassification()]
arch_config = dict(d_model=512, d_ff=512, n_layers=n_layers,
           n_heads=n_heads, dropout=dropout, pe=pe, fc_dropout=fc_dropout, attn_dropout=0.0)
learn = TSClassifier(X, y, splits=splits, tfms=tfms, arch="TSTPlus", arch_config=arch_config, 
                     loss_func=CustomLoss1(), opt_func=SGD, metrics=None, cbs=ShowGraph())
with ContextManagers([learn.no_logging(), learn.no_bar()]): 
    learn.fit_one_cycle(n_epochs, lr_max=lr_max)
jrfackler commented 1 year ago

Hi @oguiza,

Thank you. Your suggestion helped fix my training issue on the new version.

While not a classification task, I used your code and swapped out TSRegressor for TSClassifier and the training worked. I had been using code from a couple years ago I took from a tutorial notebook but I see that updates to tsai now use a different format.

One interesting thing about using TSRegressor and the new version of tsai is that the hyperparameters needed to get similar results on the validation set need to be adjusted when compared to my previous code and the old version of tsai.

This created one more issue though that I can't find an answer to. When doing inference using the code: _, _, input = learn.get_X_preds(X_valid, with_decoded=True)

The variable input is now a: fastai.torch_core.TitledTuple when I think before it was just a Tensor. And when I try to use torch functions on it I get errors that it needs to be a Tensor.

Do you know how to covert it back to a Tensor? - I realized I guess I can just use the torch.tensor(tuple) to convert it so no more questions. Thanks for your help!

oguiza commented 1 year ago

I'll close this issue based on feedback received.