timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
4.92k stars 622 forks source link

Multilabel TSClassification Tutorial Notebook Example is Broken #847

Open cversek opened 9 months ago

cversek commented 9 months ago

When running notebook 01a_MultiClass_MultiLabel_TSClassification.ipynb under the MultiLabel section, specifically this cell code, results in this error (cropped screen shot): image

Here is the full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 2
      1 learn = ts_learner(dls, InceptionTimePlus, metrics=accuracy_multi)
----> 2 learn.lr_find()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/callback/schedule.py:293, in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggest_funcs)
    291 n_epoch = num_it//len(self.dls.train) + 1
    292 cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 293 with self.no_logging(): self.fit(n_epoch, cbs=cb)
    294 if suggest_funcs is not None:
    295     lrs, losses = tensor(self.recorder.lrs[num_it//10:-5]), tensor(self.recorder.losses[num_it//10:-5])

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:264, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    262 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    263 self.n_epoch = n_epoch
--> 264 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:253, in Learner._do_fit(self)
    251 for epoch in range(self.n_epoch):
    252     self.epoch=epoch
--> 253     self._with_events(self._do_epoch, 'epoch', CancelEpochException)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:247, in Learner._do_epoch(self)
    246 def _do_epoch(self):
--> 247     self._do_epoch_train()
    248     self._do_epoch_validate()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:239, in Learner._do_epoch_train(self)
    237 def _do_epoch_train(self):
    238     self.dl = self.dls.train
--> 239     self._with_events(self.all_batches, 'train', CancelTrainException)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:205, in Learner.all_batches(self)
    203 def all_batches(self):
    204     self.n_iter = len(self.dl)
--> 205     for o in enumerate(self.dl): self.one_batch(*o)

File ~/gitwork/timeseriesAI/tsai/tsai/learner.py:40, in one_batch(self, i, b)
     38 b_on_device = to_device(b, device=self.dls.device) if self.dls.device is not None else b
     39 self._split(b_on_device)
---> 40 self._with_events(self._do_one_batch, 'batch', CancelBatchException)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:199, in Learner._with_events(self, f, event_type, ex, final)
    198 def _with_events(self, f, event_type, ex, final=noop):
--> 199     try: self(f'before_{event_type}');  f()
    200     except ex: self(f'after_cancel_{event_type}')
    201     self(f'after_{event_type}');  final()

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:219, in Learner._do_one_batch(self)
    217 self('after_pred')
    218 if len(self.yb):
--> 219     self.loss_grad = self.loss_func(self.pred, *self.yb)
    220     self.loss = self.loss_grad.clone()
    221 self('after_loss')

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/losses.py:54, in BaseLoss.__call__(self, inp, targ, **kwargs)
     52 if targ.dtype in [torch.int8, torch.int16, torch.int32]: targ = targ.long()
     53 if self.flatten: inp = inp.view(-1,inp.shape[-1]) if self.is_2d else inp.view(-1)
---> 54 return self.func.__call__(inp, targ.view(-1) if self.flatten else targ, **kwargs)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/nn/modules/loss.py:720, in BCEWithLogitsLoss.forward(self, input, target)
    719 def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 720     return F.binary_cross_entropy_with_logits(input, target,
    721                                               self.weight,
    722                                               pos_weight=self.pos_weight,
    723                                               reduction=self.reduction)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/nn/functional.py:3146, in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   3110 r"""Function that measures Binary Cross Entropy between target and input
   3111 logits.
   3112 
   (...)
   3143      >>> loss.backward()
   3144 """
   3145 if has_torch_function_variadic(input, target, weight, pos_weight):
-> 3146     return handle_torch_function(
   3147         binary_cross_entropy_with_logits,
   3148         (input, target, weight, pos_weight),
   3149         input,
   3150         target,
   3151         weight=weight,
   3152         size_average=size_average,
   3153         reduce=reduce,
   3154         reduction=reduction,
   3155         pos_weight=pos_weight,
   3156     )
   3157 if size_average is not None or reduce is not None:
   3158     reduction_enum = _Reduction.legacy_get_enum(size_average, reduce)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/overrides.py:1551, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
   1545     warnings.warn("Defining your `__torch_function__ as a plain method is deprecated and "
   1546                   "will be an error in future, please define it as a classmethod.",
   1547                   DeprecationWarning)
   1549 # Use `public_api` instead of `implementation` so __torch_function__
   1550 # implementations can do equality/identity comparisons.
-> 1551 result = torch_func_method(public_api, types, args, kwargs)
   1553 if result is not NotImplemented:
   1554     return result

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/torch_core.py:382, in TensorBase.__torch_function__(cls, func, types, args, kwargs)
    380 if cls.debug and func.__name__ not in ('__str__','__repr__'): print(func, types, args, kwargs)
    381 if _torch_handled(args, cls._opt, func): types = (torch.Tensor,)
--> 382 res = super().__torch_function__(func, types, args, ifnone(kwargs, {}))
    383 dict_objs = _find_args(args) if args else _find_args(list(kwargs.values()))
    384 if issubclass(type(res),TensorBase) and dict_objs: res.set_meta(dict_objs[0],as_copy=True)

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/_tensor.py:1295, in Tensor.__torch_function__(cls, func, types, args, kwargs)
   1292     return NotImplemented
   1294 with _C.DisableTorchFunctionSubclass():
-> 1295     ret = func(*args, **kwargs)
   1296     if func in get_default_nowrap_functions():
   1297         return ret

File ~/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/torch/nn/functional.py:3163, in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   3160     reduction_enum = _Reduction.get_enum(reduction)
   3162 if not (target.size() == input.size()):
-> 3163     raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
   3165 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

ValueError: Target size (torch.Size([384])) must be the same as input size (torch.Size([2304]))

I'm running this commit locally (currently head of main branch) and the output of my_setup() is:

os              : Linux-6.2.0-34-generic-x86_64-with-glibc2.37
python          : 3.11.3
tsai            : 0.3.8
fastai          : 2.7.12
fastcore        : 1.5.29
torch           : 2.0.1
device          : 1 gpu (['NVIDIA GeForce RTX 3090'])
cpu cores       : 24
threads per cpu : 1
RAM             : 125.53 GB
GPU memory      : [24.0] GB

I would be happy to spend quite a bit more effort in figuring out the right way to do this example. Thanks!

cversek commented 8 months ago

@oguiza @williamsdoug As I said, I'm fairly motivated to help fix this issue. I went back and ran this notebook with v0.3.7 and v0.3.6 and got the same ValueError: Target size (torch.Size([384])) must be the same as input size (torch.Size([2304]))

But when I checked out v0.3.5 that bit of code ran without error! I will post back soon with any other diagnostic information that I discover. Thanks for creating this awesome package and example code.

cversek commented 8 months ago

So, moving on from that line in v0.3.5 I do eventually run into an error at the end of the training loop. But I do consider this great progress (retrogress?). See truncated screenshot of notebook: image Full stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 2
      1 learn = ts_learner(dls, InceptionTimePlus, metrics=[partial(accuracy_multi, by_sample=True), partial(accuracy_multi, by_sample=False)], cbs=ShowGraph())
----> 2 learn.fit_one_cycle(10, lr_max=1e-3)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/callback/schedule.py:119, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch)
    116 lr_max = np.array([h['lr'] for h in self.opt.hypers])
    117 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    118           'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 119 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:264, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    262 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    263 self.n_epoch = n_epoch
--> 264 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:201, in Learner._with_events(self, f, event_type, ex, final)
    199 try: self(f'before_{event_type}');  f()
    200 except ex: self(f'after_cancel_{event_type}')
--> 201 self(f'after_{event_type}');  final()

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:172, in Learner.__call__(self, event_name)
--> 172 def __call__(self, event_name): L(event_name).map(self._call_one)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastcore/foundation.py:156, in L.map(self, f, *args, **kwargs)
--> 156 def map(self, f, *args, **kwargs): return self._new(map_ex(self, f, *args, gen=False, **kwargs))

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastcore/basics.py:840, in map_ex(iterable, f, gen, *args, **kwargs)
    838 res = map(g, iterable)
    839 if gen: return res
--> 840 return list(res)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastcore/basics.py:825, in bind.__call__(self, *args, **kwargs)
    823     if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    824 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 825 return self.func(*fargs, **kwargs)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/learner.py:176, in Learner._call_one(self, event_name)
    174 def _call_one(self, event_name):
    175     if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 176     for cb in self.cbs.sorted('order'): cb(event_name)

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/callback/core.py:62, in Callback.__call__(self, event_name)
     60     try: res = getcallable(self, event_name)()
     61     except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
---> 62     except Exception as e: raise modify_exception(e, f'Exception occured in `{self.__class__.__name__}` when calling event `{event_name}`:\n\t{e.args[0]}', replace=True)
     63 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     64 return res

File /opt/mambaforge/envs/neurovep_data/lib/python3.11/site-packages/fastai/callback/core.py:60, in Callback.__call__(self, event_name)
     58 res = None
     59 if self.run and _run: 
---> 60     try: res = getcallable(self, event_name)()
     61     except (CancelBatchException, CancelBackwardException, CancelEpochException, CancelFitException, CancelStepException, CancelTrainException, CancelValidException): raise
     62     except Exception as e: raise modify_exception(e, f'Exception occured in `{self.__class__.__name__}` when calling event `{event_name}`:\n\t{e.args[0]}', replace=True)

File ~/gitwork/timeseriesAI/tsai/tsai/callback/core.py:101, in ShowGraph.after_fit(self)
     99     plt.close(self.graph_ax.figure)
    100 if self.plot_metrics: 
--> 101     self.learn.plot_metrics(final_losses=self.final_losses, perc=self.perc)

File ~/gitwork/timeseriesAI/tsai/tsai/learner.py:231, in plot_metrics(self, **kwargs)
    228 @patch
    229 @delegates(subplots)
    230 def plot_metrics(self: Learner, **kwargs):
--> 231     self.recorder.plot_metrics(**kwargs)

File ~/gitwork/timeseriesAI/tsai/tsai/learner.py:218, in plot_metrics(self, nrows, ncols, figsize, final_losses, perc, **kwargs)
    216 else:
    217     color = '#ff7f0e'
--> 218     label = 'valid' if (m != [None] * len(m)).all() else None
    219     axs[ax_idx].grid(color='gainsboro', linewidth=.5)
    220 axs[ax_idx].plot(xs, m, color=color, label=label)

AttributeError: Exception occured in `ShowGraph` when calling event `after_fit`:
    'bool' object has no attribute 'all'

I am going to guess that some of my dependencies are too new for this older tsai version, here is my setup for the system I'm currently testing:

os              : Linux-6.2.0-34-generic-x86_64-with-glibc2.35
python          : 3.11.0
tsai            : 0.3.5
fastai          : 2.7.12
fastcore        : 1.5.29
torch           : 2.0.1
cpu cores       : 4
threads per cpu : 2
RAM             : 62.58 GB
GPU memory      : N/A
cversek commented 8 months ago

For the TSMultiLabelClassification comparing the output of learn.model for InceptionTimePlus between v0.3.5 and v0.3.6 everything is the same except for the learn.model.head:

Clearly something has changed and maybe its not shaping up the output properly?

For the Multi-class TSClassification, which works in both versions, the differences are more subtle for learn.model.head:

learn.model.head:

cversek commented 8 months ago

@oguiza I just noticed the similarity with previously fixed issues:

Could be a regression, I will try to study what the fixes were there until someone more qualified can take over ;)

cversek commented 8 months ago

@oguiza I was eventually able to hunt down the problematic commit 9caff8f by using git bisect between tags 0.3.6 (bad) and 0.3.5 (good) and checking the notebook example. There is some uncertainty about whether reverting that commit will cause other regressions in other parts of the code base, so I submitted a draft PR to fix the issue: https://github.com/timeseriesAI/tsai/pull/855

It would be awesome if the maintainers can help with implementing the fix. I have about exhausted my capabilities to really understand how the magic model auto-configuration system is supposed to work. Once a fix is decided upon and tested, I will happily close out the issue!

Munib5 commented 8 months ago

Having the same issue!

cversek commented 8 months ago

@oguiza @Munib5 Sorry, I thought I had a real fix, but I was mistaken again.

Tracing what is going wrong is a bit maddening. The problem starts with the creation of the DataLoaders: image

At this point the dls object has two attributes:

dls.c == 6
dls.d == 6

When the learner.ts_learner function is invoked as in:

learn = ts_learner(dls, InceptionTimePlus, metrics=accuracy_multi, verbose=True)

it delegates to models.utils.build_ts_model with c_out=None and d=None. Internally those parameters are obtained from the dls object (see source).

Still inside build_ts_model, a 'custom_head' argument gets tacked on to a kwargs dict (see source):

kwargs['custom_head'] = partial(kwargs['custom_head'], d=d)

here d=6. The kwargs dict is later injected into the model configuration (see source):

model = arch(c_in, c_out, seq_len=seq_len, **arch_config, **kwargs).to(device=device)

As indicated by this printout if you set verbose=True in the ts_learner call:

arch: InceptionTimePlus(c_in=1 c_out=6 seq_len=140 arch_config={} kwargs={'custom_head': functools.partial(<class 'tsai.models.layers.lin_nd_head'>, d=6)})

Subsequently when the lin_nd_head constructor is called (see source), the local shape and fd variables take on these values

shape == [6,6]
fd == 6

Later in that constructor (see code), this problematically shaped layer is appended to the model:

else:
            if seq_len == 1:
                layers += [nn.AdaptiveAvgPool1d(1)]
            if not flatten and fd == seq_len:
                layers += [Transpose(1,2), nn.Linear(n_in, n_out)]
            else:
                layers += [Reshape(), nn.Linear(n_in * seq_len, n_out * fd)]
            layers += [Reshape(*shape)]

where the keys variables take on these values:

n_in == 128
seq_len == 140
n_out == 6
fd == 6
shape == [6,6]

That shows that there is potentially a conflict between the property c which is supposed to be the "number of classes/categories" and the property d which possibly means some sort of dimension (? the source isn't very clear on these semantics).

Again, my limited understanding of the architecture of this library is providing a major impediment for finding a fix that will satisfy everyone :) Here's hoping that the maintainers will take over!

cversek commented 8 months ago

@Munib5 With all that said, a temporary workaround might be to do something like:

learn = ts_learner(dls, InceptionTimePlus, metrics=accuracy_multi, verbose=True, d=1)

where we force the d property back to something that provides a compatible shape for the torch.binary_cross_entropy_with_logits function.