jantic / DeOldify

A Deep Learning based project for colorizing and restoring old images (and video!)
MIT License
18k stars 2.57k forks source link

Training stop iteration #245

Closed gary-kaitung closed 4 years ago

gary-kaitung commented 4 years ago

Hi,

while running learn_gen.fit_one_cycle(1, pct_start=0.8, max_lr=slice(1e-3))

I encountered error as below. My changes to the ipython file are My fastai version is 1.0.51

My pytorch version is 1.0.1.post2

Changed the path to path = Path('data/manga') path_hr = path path_lr = path/'bandw'

And added about 20 pictures to this file and bandw files

I tried adding num_workers = 0 according to https://forums.fast.ai/t/brokenpipeerror-using-jupyter-notebook-lesson-1/41090/11 Although its error is about BrokenPipeError

def get_data(bs:int, sz:int, keep_pct:float): return get_colorize_data(sz=sz, bs=bs, crappy_path=path_lr, good_path=path_hr, random_seed=None, keep_pct=keep_pct, num_workers=0)

I am running the file on paperspace with very limited free GPU as trial run. Could this be memory issue? But if so, wouldn't the error tell me its about memory?

`--------------------------------------------------------------------------- StopIteration Traceback (most recent call last)

in ----> 1 learn_gen.fit_one_cycle(1, pct_start=0.8, max_lr=slice(1e-3)) /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch) 20 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start, 21 final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch)) ---> 22 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks) 23 24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, wd:float=None): /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks) 192 if not getattr(self, 'opt', False): self.create_opt(lr, wd) 193 else: self.opt.lr,self.opt.wd = lr,wd --> 194 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks) 195 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks 196 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks) /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/basic_train.py in (.0) 192 if not getattr(self, 'opt', False): self.create_opt(lr, wd) 193 else: self.opt.lr,self.opt.wd = lr,wd --> 194 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks) 195 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks 196 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks) /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/callbacks/tensorboard.py in __init__(self, learn, base_dir, name, loss_iters, hist_iters, stats_iters, visual_iters) 176 visual_iters:int=100): 177 super().__init__(learn=learn, base_dir=base_dir, name=name, loss_iters=loss_iters, hist_iters=hist_iters, --> 178 stats_iters=stats_iters) 179 self.visual_iters = visual_iters 180 self.img_gen_vis = ImageTBWriter() /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/callbacks/tensorboard.py in __init__(self, learn, base_dir, name, loss_iters, hist_iters, stats_iters) 36 self.data = None 37 self.metrics_root = '/metrics/' ---> 38 self._update_batches_if_needed() 39 40 def _get_new_batch(self, ds_type:DatasetType)->Collection[Tensor]: /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/callbacks/tensorboard.py in _update_batches_if_needed(self) 48 if not update_batches: return 49 self.data = self.learn.data ---> 50 self.trn_batch = self._get_new_batch(ds_type=DatasetType.Train) 51 self.val_batch = self._get_new_batch(ds_type=DatasetType.Valid) 52 /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/callbacks/tensorboard.py in _get_new_batch(self, ds_type) 40 def _get_new_batch(self, ds_type:DatasetType)->Collection[Tensor]: 41 "Retrieves new batch of DatasetType, and detaches it." ---> 42 return self.learn.data.one_batch(ds_type=ds_type, detach=True, denorm=False, cpu=False) 43 44 def _update_batches_if_needed(self)->None: /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm, cpu) 166 w = self.num_workers 167 self.num_workers = 0 --> 168 try: x,y = next(iter(dl)) 169 finally: self.num_workers = w 170 if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu) StopIteration:` Look forward to anyone help. Thank you very much.
gary-kaitung commented 4 years ago

Oh, my bad, I forgot to set the bs to smaller. Problem solved.

kroslaniec commented 3 years ago

Oh, my bad, I forgot to set the bs to smaller. Problem solved.

I know it's an old thing, but I am searching for the same thing as you - could you try to explain me what bs and sz means and what value have you set?

jantic commented 3 years ago

bs = "batch size" (the number of images trained during a single iteration of training) and sz is the length of each side of the image that you're resizing them to for training. sz=64, for example would mean you're resizing to 64x64 pixel squares for the images.

Generally I try to maximize batch size to fit the gpu's memory constraints. This version of DeOldify was developed using a single 11GB video card. So if you have something smaller than that you're going to have to adjust it accordingly.