timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.1k stars 639 forks source link

Error when defining ShowGraph if resuming from previous epoch #631

Closed yangtzech closed 1 year ago

yangtzech commented 1 year ago

Define a leaner

learn = Learner(dls, model, loss_func=MSELossFlat(), metrics=[rmse],  cbs=[ShowGraphCallback2(), SaveModel(monitor='valid_loss', every_epoch=True, with_opt=True)])

train for some epochs

learn.fit_one_cycle(50, 1)

then interrupts it at epoch 9.

load the previous epoch saved model again

learn = learn.load('model_8')

resume training by setting start_epoch

learn.fit_one_cycle(300, 0.12, start_epoch=9)

error occurs:

~/anaconda3/envs/tsai_dev/lib/python3.7/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch)
    117     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    118               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 119     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)
    120 
    121 # %% ../../nbs/14_callback.schedule.ipynb 50

~/anaconda3/envs/tsai_dev/lib/python3.7/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    254             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    255             self.n_epoch = n_epoch
--> 256             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    257 
    258     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/anaconda3/envs/tsai_dev/lib/python3.7/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    191 
    192     def _with_events(self, f, event_type, ex, final=noop):
--> 193         try: self(f'before_{event_type}');  f()
    194         except ex: self(f'after_cancel_{event_type}')
    195         self(f'after_{event_type}');  final()
...
    427 
    428     result_ndim = arrays[0].ndim + 1

ValueError: Exception occured in `ShowGraph` when calling event `after_epoch`:
    all input arrays must have the same shape

No error if deleting ShowGraphCallback2() before loading the learner from the previous epoch.

oguiza commented 1 year ago

Hi @yangtzech , I've been able to replicate this issue. It comes from fastai. The ShowGraph code uses fastai's ShowGraphCallback as a basis. But the original design seems to not have taken into account the start_epoch option. You can actually reproduce the same issue if you replace ShowGraph with ShowGraphCallback. It'd be good if you would create an issue in the fastai repo.

yangtzech commented 1 year ago

Sorry, I thought ShowGraph caused it. I'll create an issue in fastai.

oguiza commented 1 year ago

@yangtzech, No problem. You could use a pure fastai code snippet like this to reproduce the issue:

from fastai.test_utils import *
cbs = [ShowGraphCallback(), 
       SaveModel(monitor='valid_loss', every_epoch=True, with_opt=True)]
learn = synth_learner(cbs=cbs)
learn.fit(50)

Stop at iteration 10 for any reason. Then do:

learn = learn.load('model_10')
learn.fit_one_cycle(50, start_epoch=11)
yangtzech commented 1 year ago

Thanks, @oguiza! It's opened here.