Beginner: learn.lr_find returns IndexError, and learn.fit_one_cycle returns nan for both train_loss and valid_loss

qixing0375 commented 1 year ago

dls and model are exactly the same as in the tutorial.

I have checked there is no NA in my input dataset. Does anyone know the probelm here? THANKS!!!

qixing0375 commented 1 year ago

When I use sampe size less than 2000, all processes are normal. But once I increase sample size, all the problems shown above appears.

oguiza commented 1 year ago

I'm sorry @qixing0375 , but I cannot debug the code if I cannot reproduce the issue. Can you please provide a code snippet that reproduces it? It doesn't matter that the data is random as long as it maintains the same shape. I also need you to paste the full stack trace between ```

qixing0375 commented 1 year ago

X.shape, y.shape

((7970, 33, 36), (7970,))

splits = get_splits(y, valid_size=0.2, random_state=23, shuffle = True)
tfms = [None, [Categorize()]]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits, inplace=True)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[64, 128], batch_tfms=[TSStandardize(by_var=True)], num_workers=0)
model = InceptionTime(dls.vars, dls.c)
learn = Learner(dls, model, metrics=accuracy)
learn.save('stage0')
learn.load('stage0')
learn.lr_find()

16.00% [4/25 00:13<01:10] epoch train_loss valid_loss accuracy time 0 nan nan 0.958595 00:03 1 nan nan 0.958595 00:03 2 nan nan 0.958595 00:03 3 nan nan 0.958595 00:03

4.04% [4/99 00:00<00:03 nan] Output exceeds the size limit. Open the full output data in a text editor

KeyboardInterrupt Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13020\4198329076.py in ----> 1 learn.fit_one_cycle(25, lr_max=1e-3) 2 learn.save('stage1')

c:\Users\qixin\anaconda3\lib\site-packages\fastai\callback\schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch) 117 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final), 118 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))} --> 119 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch) 120 121 # %% ../../nbs/14_callback.schedule.ipynb 50

c:\Users\qixin\anaconda3\lib\site-packages\fastai\learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch) 254 self.opt.set_hypers(lr=self.lr if lr is None else lr) 255 self.n_epoch = n_epoch --> 256 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup) 257 258 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

c:\Users\qixin\anaconda3\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final) 191 192 def _with_events(self, f, eventtype, ex, final=noop): --> 193 try: self(f'before{event_type}'); f() 194 except ex: self(f'aftercancel{event_type}') ... --> 160 sqravg.mul(sqrmom).addcmul(p.grad.data, p.grad.data, value=damp) 161 return {'sqr_avg': sqr_avg} 162

KeyboardInterrupt: arch hyperparams total params train loss valid loss accuracy time 0 FCN {} 292865 NaN NaN 0.958595 6 1 ResNet {} 494721 NaN NaN 0.958595 12 XResNet

20.00% [2/10 00:06<00:26] epoch train_loss valid_loss accuracy time 0 nan nan 0.958595 00:03 1 nan nan 0.958595 00:03

39.39% [39/99 00:01<00:01 nan] Output exceeds the size limit. Open the full output data in a text editor

KeyboardInterrupt Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13020\63561938.py in 10 learn = Learner(dls, model, metrics=accuracy) 11 start = time.time() ---> 12 learn.fit_one_cycle(10, 1e-3) 13 elapsed = time.time() - start 14 vals = learn.recorder.values[-1]

c:\Users\qixin\anaconda3\lib\site-packages\fastai\callback\schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt, start_epoch) 117 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final), 118 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))} --> 119 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch) 120 121 # %% ../../nbs/14_callback.schedule.ipynb 50

c:\Users\qixin\anaconda3\lib\site-packages\fastai\learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch) 254 self.opt.set_hypers(lr=self.lr if lr is None else lr) 255 self.n_epoch = n_epoch --> 256 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup) 257 258 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

c:\Users\qixin\anaconda3\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final) 191 ... --> 197 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 198 tensors, gradtensors, retain_graph, create_graph, inputs, 199 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass

KeyboardInterrupt:

IndexError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13680\2529580247.py in 1 learn.load('stage0') ----> 2 learn.lr_find()

c:\Users\qixin\anaconda3\lib\site-packages\fastai\callback\schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggest_funcs) 302 for func in tuplify(suggest_funcs): 303 nms.append(func.name if not isinstance(func, partial) else func.func.name) # deal with partials --> 304 _suggestions.append(func(lrs, losses, num_it)) 305 306 SuggestedLRs = collections.namedtuple('SuggestedLRs', nms)

c:\Users\qixin\anaconda3\lib\site-packages\fastai\callback\schedule.py in valley(lrs, losses, num_it) 229 idx = max_start + int(sections) + int(sections/2) 230 --> 231 return float(lrs[idx]), (float(lrs[idx]), losses[idx]) 232 233 # %% ../../nbs/14_callback.schedule.ipynb 81

IndexError: index 0 is out of bounds for dimension 0 with size 0

learn.fit_one_cycle(25, lr_max=1e-3)
learn.save('stage1')

archs = [(FCN, {}), (ResNet, {}), (xresnet1d34, {}), (ResCNN, {}), 
         (LSTM, {'n_layers':1, 'bidirectional': False}), (LSTM, {'n_layers':2, 'bidirectional': False}), (LSTM, {'n_layers':3, 'bidirectional': False}), 
         (LSTM, {'n_layers':1, 'bidirectional': True}), (LSTM, {'n_layers':2, 'bidirectional': True}), (LSTM, {'n_layers':3, 'bidirectional': True}),
         (LSTM_FCN, {}), (LSTM_FCN, {'shuffle': False}), (InceptionTime, {}), (XceptionTime, {}), (OmniScaleCNN, {}), (mWDN, {'levels': 4})]

results = pd.DataFrame(columns=['arch', 'hyperparams', 'total params', 'train loss', 'valid loss', 'accuracy', 'time'])
for i, (arch, k) in enumerate(archs):
    model = create_model(arch, dls=dls, **k)
    print(model.__class__.__name__)
    learn = Learner(dls, model,  metrics=accuracy)
    start = time.time()
    learn.fit_one_cycle(10, 1e-3)
    elapsed = time.time() - start
    vals = learn.recorder.values[-1]
    results.loc[i] = [arch.__name__, k, count_parameters(model), vals[0], vals[1], vals[2], int(elapsed)]
    results.sort_values(by='accuracy', ascending=False, ignore_index=True, inplace=True)
    clear_output()
    display(results)

I have also tried other models, but showed the similar result.

I tried with a randomly generated np.arrary with same shape, the code work. But it somehow didn’t work with my own array. I have checked that there is no NA in my array, and the batch visualization seems okay that all variable are standardized. Do you have any idea?

Thanks! Qixing

qixing0375 commented 1 year ago

My npy file size is about 75mb, which is a little bit large. I tested a little bit more, when there is about 6000 samples the code works, but if I increased sample size to 9000, the error appeared. Are you still wanna try the bug? If so, how should I share you the npy?

oguiza commented 1 year ago

Hi @qixing0375 , I tried to reproduce the issue based on the information you have provided but have failed. Everything seems to be working well even if I create a dataset with 50k samples (see gist). Can you please run my_setup() and check_data(X, y, splits) and paste the output here.

qixing0375 commented 1 year ago

Problem solved. There is NA mixed in the middle of my samples, that is why when I increase sample input and error appeared. Thank you so much!

timeseriesAI / tsai