wandb / examples

Example deep learning projects that use wandb's features.
http://wandb.ai
1.12k stars 291 forks source link

fastai integration does not work #53

Closed DmitriyG228 closed 2 years ago

DmitriyG228 commented 3 years ago

Hi, fastai how not work with error:


ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-d0cab4097689> in <module>
----> 1 import wandb.fastai

~/anaconda3/envs/ab/lib/python3.7/site-packages/wandb/fastai/__init__.py in <module>
      6 """
      7 
----> 8 from wandb.integration.fastai import WandbCallback
      9 
     10 __all__ = ['WandbCallback']

~/anaconda3/envs/ab/lib/python3.7/site-packages/wandb/integration/fastai/__init__.py in <module>
     38 import wandb
     39 import fastai
---> 40 from fastai.callbacks import TrackerCallback
     41 from pathlib import Path
     42 import random

ModuleNotFoundError: No module named 'fastai.callbacks'

They have recently released 2.0, that most probably causes problems with the callback

vanpelt commented 3 years ago

Hey @DmitriyG228 in fastai 2.0 the callback is now included with fastai itself

DmitriyG228 commented 3 years ago

That's great, I have picked an outdated tutorial browsing links from the official site.

The other problem with WandbCallback is that it seems to conflict with the following custom callback:


class ReloadCallback(Callback):

    def before_epoch(self):  
        print(learn.n)
        if learn.n %6 == 0:
            t_dl,v_dl = q.get() # get returns of the parallel process
            print('got')
            learn.dls =  DataLoaders(t_dl,v_dl).cuda()
            print('submitted')

            learn.prep_dls_proccess = Process(target=prep_dls,args=(q,)) 
            learn.prep_dls_proccess.start() # start parallel process (dataloaders prep)
        else : pass
        learn.n+=1

The above callback never starts it's job in the prrecence of WandbCallback

The following code starts with set-up data loaders dls, that are used to initiate the training and starts parallel process of dls preparation. ReloadCallback manages dataloaders preparation and submission further in the training loop.

learn = Learner(dls, model, CrossEntropyLossFlat(), opt_func=opt_func, metrics=[accuracy]).to_fp32()
learn.n = 0
q = multiprocessing.Queue()
learn.prep_dls_proccess = Process(target=prep_dls, args=(q,))
learn.prep_dls_proccess.start()
learn.add_cb(ReloadCallback())
learn.add_cb(WandbCallback())

learn.fit_one_cycle(100, lr_max=lr)

the script is stuck in the last line forver in the presence of WandbCallback and works as expected without it.

Any ideas on what should I change in my code to be able to use wandb with it?

Thank you!

vanpelt commented 3 years ago

@borisdayma is the original author of this callback, any ideas?

borisdayma commented 3 years ago

Hi @DmitriyG228

Can you let me know where you found the original example so we make sure to tag it appropriately? For info the best page to see fastai documentation and examples is here.

Also do you have a full example of the issue you have or at least the full traceback? The place where I can think of a potential issue would be when WandCallback tries to get sample predictions but it may be fixed by only setting the order of the callbacks.

You may also be interested in using the native Distributed DataLoaders as they have several optimizations under the hood.

DmitriyG228 commented 3 years ago

Hi @borisdayma, the above ReloadCallback is my own code that is used to supply tabular data to the training loop in chunks, since the full dataset is out-of-memory. It is preparing the dataloaders in parallel with the training, os I don't do any distributed or parallel learning with this code.

There is no traceback since there is no error as such, but the script is waiting forever on learn.fit_one_cycle, obviously due to some conflict between the two callbacks.

I can prepare a reproducible example tomorrow, if it makes sense, but I probably should refactor my code the avoid conflicts with well-established software.

borisdayma commented 3 years ago

No worry, sometimes just by doing Ctrl + C, you can see where the script is hanging.

borisdayma commented 3 years ago

Also try to turn off certain features: WandbCallback(log=None, log_preds=False, log_model=False, log_dataset=False)

DmitriyG228 commented 3 years ago

Hi, I have prepared a reproducible example and at has actually nothing to do with none of the callbacks themselves, but there is some problem occurs with multiprocessing once wandb is initialized. Please run the following collab notebook

I would mention again that my callback is used to feed chunked tabular data into the training loop, once preparing dls for the next epoch at the same time.

the full traceback:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-14-38cc6af58017> in <module>
----> 1 learn.fit_one_cycle(100)

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/logargs.py in _f(*args, **kwargs)
     54         init_args.update(log)
     55         setattr(inst, 'init_args', init_args)
---> 56         return inst if to_return else f(*args, **kwargs)
     57     return _f

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    111     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    112               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 113     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    114 
    115 # Cell

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/logargs.py in _f(*args, **kwargs)
     54         init_args.update(log)
     55         setattr(inst, 'init_args', init_args)
---> 56         return inst if to_return else f(*args, **kwargs)
     57     return _f

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    205             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    206             self.n_epoch = n_epoch
--> 207             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    208 
    209     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    153 
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _do_fit(self)
    195         for epoch in range(self.n_epoch):
    196             self.epoch=epoch
--> 197             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    198 
    199     @log_args(but='cbs')

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    153 
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in __call__(self, event_name)
    131     def ordered_cbs(self, event): return [cb for cb in sort_by_run(self.cbs) if hasattr(cb, event)]
    132 
--> 133     def __call__(self, event_name): L(event_name).map(self._call_one)
    134 
    135     def _call_one(self, event_name):

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
    224     def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
    225 
--> 226     def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
    227     def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
    228     def filter(self, f=noop, negate=False, gen=False, **kwargs):

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
    541     res = map(g, iterable)
    542     if gen: return res
--> 543     return list(res)
    544 
    545 # Cell

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
    531             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    532         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 533         return self.func(*fargs, **kwargs)
    534 
    535 # Cell

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in _call_one(self, event_name)
    135     def _call_one(self, event_name):
    136         assert hasattr(event, event_name), event_name
--> 137         [cb(event_name) for cb in sort_by_run(self.cbs)]
    138 
    139     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/learner.py in <listcomp>(.0)
    135     def _call_one(self, event_name):
    136         assert hasattr(event, event_name), event_name
--> 137         [cb(event_name) for cb in sort_by_run(self.cbs)]
    138 
    139     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/anaconda3/envs/ab/lib/python3.7/site-packages/fastai/callback/core.py in __call__(self, event_name)
     42                (self.run_valid and not getattr(self, 'training', False)))
     43         res = None
---> 44         if self.run and _run: res = getattr(self, event_name, noop)()
     45         if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     46         return res

<ipython-input-6-0fe932cc3fdd> in before_epoch(self)
      4         print(learn.n)
      5         if learn.n %6 == 0:
----> 6             t_dl,v_dl = q.get()
      7             print('got')
      8             learn.dls =  DataLoaders(t_dl,v_dl).cuda()

~/anaconda3/envs/ab/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout)
     92         if block and timeout is None:
     93             with self._rlock:
---> 94                 res = self._recv_bytes()
     95             self._sem.release()
     96         else:

~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

~/anaconda3/envs/ab/lib/python3.7/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

KeyboardInterrupt: 
borisdayma commented 3 years ago

Thanks @DmitriyG228 I confirm the bug. On a side note, I wonder if you could directly customize Dataloaders getters to avoid your OOM problem.

I updated a bit your notebook so that we can easily reproduce the issue: colab

As a summary:

I'm wondering if there's a conflict with the monitoring of computer resources. We'll investigate. Thanks again for reporting it.

scottire commented 2 years ago

@borisdayma Can this be closed?

borisdayma commented 2 years ago

Yes, I think so