Closed DmitriyG228 closed 2 years ago
Hey @DmitriyG228 in fastai 2.0 the callback is now included with fastai itself
That's great, I have picked an outdated tutorial browsing links from the official site.
The other problem with WandbCallback
is that it seems to conflict with the following custom callback:
class ReloadCallback(Callback):
def before_epoch(self):
print(learn.n)
if learn.n %6 == 0:
t_dl,v_dl = q.get() # get returns of the parallel process
print('got')
learn.dls = DataLoaders(t_dl,v_dl).cuda()
print('submitted')
learn.prep_dls_proccess = Process(target=prep_dls,args=(q,))
learn.prep_dls_proccess.start() # start parallel process (dataloaders prep)
else : pass
learn.n+=1
The above callback never starts it's job in the prrecence of WandbCallback
The following code starts with set-up data loaders dls
, that are used to initiate the training and starts parallel process of dls preparation. ReloadCallback
manages dataloaders preparation and submission further in the training loop.
learn = Learner(dls, model, CrossEntropyLossFlat(), opt_func=opt_func, metrics=[accuracy]).to_fp32()
learn.n = 0
q = multiprocessing.Queue()
learn.prep_dls_proccess = Process(target=prep_dls, args=(q,))
learn.prep_dls_proccess.start()
learn.add_cb(ReloadCallback())
learn.add_cb(WandbCallback())
learn.fit_one_cycle(100, lr_max=lr)
the script is stuck in the last line forver in the presence of WandbCallback
and works as expected without it.
Any ideas on what should I change in my code to be able to use wandb
with it?
Thank you!
@borisdayma is the original author of this callback, any ideas?
Hi @DmitriyG228
Can you let me know where you found the original example so we make sure to tag it appropriately? For info the best page to see fastai documentation and examples is here.
Also do you have a full example of the issue you have or at least the full traceback?
The place where I can think of a potential issue would be when WandCallback
tries to get sample predictions but it may be fixed by only setting the order
of the callbacks.
You may also be interested in using the native Distributed DataLoaders as they have several optimizations under the hood.
Hi @borisdayma, the above ReloadCallback
is my own code that is used to supply tabular data to the training loop in chunks, since the full dataset is out-of-memory. It is preparing the dataloaders
in parallel with the training, os I don't do any distributed or parallel learning with this code.
There is no traceback since there is no error as such, but the script is waiting forever on learn.fit_one_cycle
, obviously due to some conflict between the two callbacks.
I can prepare a reproducible example tomorrow, if it makes sense, but I probably should refactor my code the avoid conflicts with well-established software.
No worry, sometimes just by doing Ctrl + C, you can see where the script is hanging.
Also try to turn off certain features: WandbCallback(log=None, log_preds=False, log_model=False, log_dataset=False)
Hi, I have prepared a reproducible example and at has actually nothing to do with none of the callbacks themselves, but there is some problem occurs with multiprocessing
once wandb
is initialized. Please run the following collab notebook
I would mention again that my callback is used to feed chunked tabular data into the training loop, once preparing dls
for the next epoch at the same time.
the full traceback:
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-14-38cc6af58017> in <module>
----> 1 learn.fit_one_cycle(100)
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/logargs.py in _f(*args, **kwargs)
54 init_args.update(log)
55 setattr(inst, 'init_args', init_args)
---> 56 return inst if to_return else f(*args, **kwargs)
57 return _f
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
111 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
112 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 113 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
114
115 # Cell
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/logargs.py in _f(*args, **kwargs)
54 init_args.update(log)
55 setattr(inst, 'init_args', init_args)
---> 56 return inst if to_return else f(*args, **kwargs)
57 return _f
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
205 self.opt.set_hypers(lr=self.lr if lr is None else lr)
206 self.n_epoch = n_epoch
--> 207 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
208
209 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
153
154 def _with_events(self, f, event_type, ex, final=noop):
--> 155 try: self(f'before_{event_type}') ;f()
156 except ex: self(f'after_cancel_{event_type}')
157 finally: self(f'after_{event_type}') ;final()
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _do_fit(self)
195 for epoch in range(self.n_epoch):
196 self.epoch=epoch
--> 197 self._with_events(self._do_epoch, 'epoch', CancelEpochException)
198
199 @log_args(but='cbs')
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
153
154 def _with_events(self, f, event_type, ex, final=noop):
--> 155 try: self(f'before_{event_type}') ;f()
156 except ex: self(f'after_cancel_{event_type}')
157 finally: self(f'after_{event_type}') ;final()
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in __call__(self, event_name)
131 def ordered_cbs(self, event): return [cb for cb in sort_by_run(self.cbs) if hasattr(cb, event)]
132
--> 133 def __call__(self, event_name): L(event_name).map(self._call_one)
134
135 def _call_one(self, event_name):
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
224 def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
225
--> 226 def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
227 def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
228 def filter(self, f=noop, negate=False, gen=False, **kwargs):
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
541 res = map(g, iterable)
542 if gen: return res
--> 543 return list(res)
544
545 # Cell
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
531 if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
532 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 533 return self.func(*fargs, **kwargs)
534
535 # Cell
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _call_one(self, event_name)
135 def _call_one(self, event_name):
136 assert hasattr(event, event_name), event_name
--> 137 [cb(event_name) for cb in sort_by_run(self.cbs)]
138
139 def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in <listcomp>(.0)
135 def _call_one(self, event_name):
136 assert hasattr(event, event_name), event_name
--> 137 [cb(event_name) for cb in sort_by_run(self.cbs)]
138
139 def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)
~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/callback/core.py in __call__(self, event_name)
42 (self.run_valid and not getattr(self, 'training', False)))
43 res = None
---> 44 if self.run and _run: res = getattr(self, event_name, noop)()
45 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
46 return res
<ipython-input-6-0fe932cc3fdd> in before_epoch(self)
4 print(learn.n)
5 if learn.n %6 == 0:
----> 6 t_dl,v_dl = q.get()
7 print('got')
8 learn.dls = DataLoaders(t_dl,v_dl).cuda()
~/anaconda3/envs/ab/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout)
92 if block and timeout is None:
93 with self._rlock:
---> 94 res = self._recv_bytes()
95 self._sem.release()
96 else:
~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in recv_bytes(self, maxlength)
214 if maxlength is not None and maxlength < 0:
215 raise ValueError("negative maxlength")
--> 216 buf = self._recv_bytes(maxlength)
217 if buf is None:
218 self._bad_message_length()
~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in _recv_bytes(self, maxsize)
405
406 def _recv_bytes(self, maxsize=None):
--> 407 buf = self._recv(4)
408 size, = struct.unpack("!i", buf.getvalue())
409 if maxsize is not None and size > maxsize:
~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in _recv(self, size, read)
377 remaining = size
378 while remaining > 0:
--> 379 chunk = read(handle, remaining)
380 n = len(chunk)
381 if n == 0:
KeyboardInterrupt:
Thanks @DmitriyG228 I confirm the bug. On a side note, I wonder if you could directly customize Dataloaders getters to avoid your OOM problem.
I updated a bit your notebook so that we can easily reproduce the issue: colab
As a summary:
multiprocessing.Queue
wandb.init
makes the process hangI'm wondering if there's a conflict with the monitoring of computer resources. We'll investigate. Thanks again for reporting it.
@borisdayma Can this be closed?
Yes, I think so
Hi, fastai how not work with error:
They have recently released 2.0, that most probably causes problems with the callback