timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.19k stars 649 forks source link

Serializing Time Series Forecasts #258

Closed DonRomaniello closed 2 years ago

DonRomaniello commented 2 years ago

I've been getting great results on forecasting a multistep horizon for a multivariate time series, but am having a lot of trouble exporting or saving the model to use on other machines or even in the same Jupyter Notebook.

I create the learner with ts_learner, train it, but when I use learner.save or learner.export, the imported model doesn't have the same predictions.

Any help would be appreciated.

vrodriguezf commented 2 years ago

How are you doing the inference?

DonRomaniello commented 2 years ago

Thank you for responding, and I hope this is what you're looking for:

I have a pandas dataset, and I feed sixty time steps and ask for the next 30 for one column. Here is the code:

columnNumber = 409

get_y = docks.columns[columnNumber]

lookAhead = 30

window = 60

X, y = SlidingWindow(window, stride=None, horizon=lookAhead, seq_first=True, get_y=get_y)(docks)

validationPercent = .3

splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=True)

tfms  = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X,
                 y,
                 splits=splits,
                 tfms=tfms,
                 batch_tfms=batch_tfms,
                 bs=[int(np.round((len(X) / 4))), int(np.round(((len(X) / 4) * validationPercent)))])

learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learningRate = learn.lr_find()[0]

learn.fit_one_cycle(300, learningRate)

The predictions are really great, but getting them ready for the road is the challenge I'm having. I've tried different architectures, to no avail.

vrodriguezf commented 2 years ago

that code is for the training, what about the inference?

DonRomaniello commented 2 years ago

Ah, yes of course.

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)

del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)

a, b, c = learn.get_X_preds(X)
vrodriguezf commented 2 years ago

so, if you do learn.get_X_preds(X) before exporting/importing, you have different results than doing the same after exporting/importing?

oguiza commented 2 years ago

Hi @DonRomaniello, I've created a quick test and the code seems to work well. Here's the snippet I've created:

X, y, splits = get_regression_data('Covid3Month', split_data=False)
y_multistep = y.reshape(-1,1).repeat(3, 1) # repeat steps to simulate a 3 step forecast
tfms  = [None, TSRegression()]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X, y_multistep, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learn.fit_one_cycle(2)
p, *_ = learn.get_X_preds(X)
print(p.shape)

torch.Size([201, 3]) # output

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)

torch.Size([201, 3]) # output True

I'm not sure if you are following a different process, but his is working well.

DonRomaniello commented 2 years ago

so, if you do learn.get_X_preds(X) before exporting/importing, you have different results than doing the same after exporting/importing?

@vrodriguezf, correct.

I'm not sure if you are following a different process, but his is working well.

@oguiza

The code you shared is working for the example you provided, however when I applied it to my code it isn't having the same effect.

oguiza commented 2 years ago

It’d be good if you can find the difference between your code and the one I shared. I’m not sure where the issue is coming from.

DonRomaniello commented 2 years ago

@oguiza

The only thing I can think of is that I am using SlidingWindow and get_splits, but if the dataloaders stay the same shouldn't the model have similar predictions?

oguiza commented 2 years ago

That shouldn’t have an impact on the saved learner. Could you please use check_data(X, y, splits) and share the output?

DonRomaniello commented 2 years ago

Sure thing:

X      - shape: [718 samples x 1582 features x 60 timesteps]  type: ndarray  dtype:float64  isnan: 0
y      - shape: (718, 30)  type: ndarray  dtype:float64  isnan: 0
splits - n_splits: 2 shape: [503, 215]  overlap: [False]
oguiza commented 2 years ago

I don’t see anything strange. I’m sorry but I don’t know how to help.

DonRomaniello commented 2 years ago

Thank you, and to be honest I am relieved that I wasn't missing something.

I'll try manually creating the sliding windows and see if that does anything.

DonRomaniello commented 2 years ago

Actually, I wonder if this sheds any light:

When I load the model and then try to fit_one_cycle, I get this:

ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_1923/2899536512.py in <module>
----> 1 learn.fit_one_cycle(10)
      2 beep()

~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    114     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    117 
    118 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    219             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    220             self.n_epoch = n_epoch
--> 221             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    222 
    223     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
    210         for epoch in range(self.n_epoch):
    211             self.epoch=epoch
--> 212             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    213 
    214     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch(self)
    204 
    205     def _do_epoch(self):
--> 206         self._do_epoch_train()
    207         self._do_epoch_validate()
    208 

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch_train(self)
    196     def _do_epoch_train(self):
    197         self.dl = self.dls.train
--> 198         self._with_events(self.all_batches, 'train', CancelTrainException)
    199 
    200     def _do_epoch_validate(self, ds_idx=1, dl=None):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
    139 
    140     def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141     def __call__(self, event_name): L(event_name).map(self._call_one)
    142 
    143     def _call_one(self, event_name):

~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
    153     def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
    154 
--> 155     def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
    156     def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
    157     def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))

~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
    696     res = map(g, iterable)
    697     if gen: return res
--> 698     return list(res)
    699 
    700 # Cell

~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
    681             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    682         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683         return self.func(*fargs, **kwargs)
    684 
    685 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
    143     def _call_one(self, event_name):
    144         if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145         for cb in self.cbs.sorted('order'): cb(event_name)
    146 
    147     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
     43                (self.run_valid and not getattr(self, 'training', False)))
     44         res = None
---> 45         if self.run and _run: res = getattr(self, event_name, noop)()
     46         if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     47         return res

~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in before_train(self)
     23         if getattr(self, 'mbar', False): self.mbar.update(self.epoch)
     24 
---> 25     def before_train(self):    self._launch_pbar()
     26     def before_validate(self): self._launch_pbar()
     27     def after_train(self):     self.pbar.on_iter_end()

~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in _launch_pbar(self)
     32 
     33     def _launch_pbar(self):
---> 34         self.pbar = progress_bar(self.dl, parent=getattr(self, 'mbar', None), leave=False)
     35         self.pbar.update(0)
     36 

~/.local/lib/python3.8/site-packages/fastprogress/fastprogress.py in __init__(self, gen, total, display, leave, parent, master, comment)
     17     def __init__(self, gen, total=None, display=True, leave=True, parent=None, master=None, comment=''):
     18         self.gen,self.parent,self.master,self.comment = gen,parent,master,comment
---> 19         self.total = len(gen) if total is None else total
     20         self.last_v = 0
     21         if parent is None: self.leave,self.display = leave,display

~/.local/lib/python3.8/site-packages/fastai/data/load.py in __len__(self)
     92         if self.n is None: raise TypeError
     93         if self.bs is None: return self.n
---> 94         return self.n//self.bs + (0 if self.drop_last or self.n%self.bs==0 else 1)
     95 
     96     def get_idxs(self):

ZeroDivisionError: integer division or modulo by zero
vrodriguezf commented 2 years ago

It seems that the batch size has been lost somehow. Try setting it manually (learn.dls.train.bs and learn.dls.valid.bs) and see if that helps

oguiza commented 2 years ago

When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.

I'm curious when you say predictions are different, what do you mean? are they still created but with different values? Could you please re-run the code I sent you before with your X, y and splits and share the output?

DonRomaniello commented 2 years ago

@vrodriguezf

<It seems that the batch size has been lost somehow. Try setting it manually (learn.dls.train.bs and learn.dls.valid.bs) and see if that helps>

It helped push the problem down the road a little...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2592/2743683231.py in <module>
----> 1 learn.fit_one_cycle(100)
      2 beep()

~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    114     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    117 
    118 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    219             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    220             self.n_epoch = n_epoch
--> 221             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    222 
    223     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
    210         for epoch in range(self.n_epoch):
    211             self.epoch=epoch
--> 212             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    213 
    214     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
--> 165         self(f'after_{event_type}');  final()
    166 
    167     def all_batches(self):

~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
    139 
    140     def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141     def __call__(self, event_name): L(event_name).map(self._call_one)
    142 
    143     def _call_one(self, event_name):

~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
    153     def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
    154 
--> 155     def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
    156     def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
    157     def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))

~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
    696     res = map(g, iterable)
    697     if gen: return res
--> 698     return list(res)
    699 
    700 # Cell

~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
    681             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    682         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683         return self.func(*fargs, **kwargs)
    684 
    685 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
    143     def _call_one(self, event_name):
    144         if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145         for cb in self.cbs.sorted('order'): cb(event_name)
    146 
    147     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
     43                (self.run_valid and not getattr(self, 'training', False)))
     44         res = None
---> 45         if self.run and _run: res = getattr(self, event_name, noop)()
     46         if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     47         return res

~/.local/lib/python3.8/site-packages/tsai/callback/core.py in after_epoch(self)
     88         x_bounds = (0, len(rec.losses))
     89         if self.epoch == 0:
---> 90             y_min = min((min(rec.losses), min(val_losses)))
     91             y_max = max((max(rec.losses), max(val_losses)))
     92         else:

ValueError: min() arg is an empty sequence

@oguiza <When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.>

I've tried recreating the learner and then simply replacing the model with

learn2 = load_learner(PATH, cpu=True)
learn.model = learn2.model

but the predictions have different values from the same input data.

I've rerun the code you provided on my data, here is the output:

p, *_ = learn.get_X_preds(X)
print(p.shape)

torch.Size([479, 30])

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)

torch.Size([479, 30]) False

Same problem even when manually setting the dataloaders with @vrodriguezf's method.

Seriously, thank you for helping with this.

oguiza commented 2 years ago

I'm afraid I'm unable to help. It'd be good if you can recreate the issue with some dummy data.

DonRomaniello commented 2 years ago

@oguiza

Same problem with dummy data:

emptyArray1 = []

for range in np.arange(0,.9,.001):
  emptyArray1.append(np.arange(range,(range + .1),.0001))

dummy = pd.DataFrame(emptyArray1)

dummy[(len(dummy.columns) - 1)] = dummy.mean(axis=1)

columnNumber = (len(dummy.columns) - 1)
get_y = dummy.columns[columnNumber]
lookAhead = 5
window = 10
X, y = SlidingWindow(window, stride=(lookAhead + window), horizon=lookAhead, seq_first=True, get_y=get_y)(dummy)

validationPercent = .3
splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=False)
check_data(X, y, splits)

tfms  = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)

train_batch = int(np.round((len(X))))

valid_batch = int(np.round(((len(X)) * validationPercent)))

dls = get_ts_dls(X,
                 y,
                 splits=splits,
                 tfms=tfms,
                 batch_tfms=batch_tfms,
                 bs=[train_batch, valid_batch])

optimizer = Adam

learn = ts_learner(dls, arch=InceptionTimePlus, metrics=[mae, rmse], cbs=[ShowGraph()], opt_func=optimizer)
learningRate = learn.lr_find()[0]

learn.fit_one_cycle(200, learningRate)

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([60, 5]) 0.2932287153601242

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)

torch.Size([60, 5]) 0.2932296804365857 False

Although, the MSE is much closer than with my data.

oguiza commented 2 years ago

Ok, I've tried it and while it's true that there's a difference between the predictions, it's minor. did this:

torch.max(p - p2)

and the max diff is tensor(1.2338e-05). I don't know where this comes from. Sometimes this is due to conversion between types.

Edit: I've also tried it with the data and code I sent you before and the difference torch.max(torch.abs(p - p2)) = tensor(2.9802e-08)

oguiza commented 2 years ago

I've found the root cause. There is a difference because the learner initially creates the predictions on the GPU. When you load the model it creates them on the CPU. If you change it to cpu=False, then there's no difference. There must be a Pytorch difference between tensors in GPU and cuda.

DonRomaniello commented 2 years ago

@oguiza

I've found the root cause.

OK, so it sounds like if I want to deploy this on a CPU, I have to train it on a CPU?

DonRomaniello commented 2 years ago

Well... I did a little test, and am not sure if the GPU to CPU change is the issue:

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([2999, 5]) 0.256418886837684

learn.model = learn.model.cpu()
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)

torch.Size([2999, 5]) 0.256418886837684 True

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p3, *_ = learn.get_X_preds(X)
print(p3.shape, skm.mean_squared_error(y, p3, squared=False))
torch.equal(p2, p3)

torch.Size([2999, 5]) 0.2570435682650654 False

Obviously the difference is very small, but it is interesting that the issue seems to happen somewhere in here:

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)

Edit: Hold on, are the dataloaders also on the GPU?

DonRomaniello commented 2 years ago

I've isolated the issue, it's the dataloaders going from GPU to CPU.

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([2999, 5]) 0.5691404370332382

learn.model = learn.model.cpu()
learn.dls = learn.dls.cpu()

p2, *_ = learn.get_X_preds(X) print(p2.shape, skm.mean_squared_error(y, p2, squared=False)) torch.equal(p, p2)


torch.Size([2999, 5]) 0.5692150288534804
False
DonRomaniello commented 2 years ago

I will try training the model on GPU with the dataloaders on CPU this evening when EC2 capacity is available and will report back.

Thank you @oguiza and @vrodriguezf for all the help.

williamsdoug commented 2 years ago

Hi @DonRomaniello. An issue that can occur when going between CPU and GPU is ordering sensitivity for floating point numbers, particularly with respect to summation operations. Below is a simple example

>>> import numpy as np
>>> A = np.array([1/3], dtype=np.float32)
>>> B  = np.array([100000/3], dtype=np.float32)

>>> A - B + B
array([0.33203125], dtype=float32)
>>> A + (-B + B) 
array([0.33333334], dtype=float32)

>>> B - B + A
array([0.33333334], dtype=float32)
>>> B +(- B + A)
array([0.33203125], dtype=float32)

In theory the associativity principle should yield the same answer in all of the above cases, however, limited mantissa precision can result in differences in the least significant digits depending on the order of evaluation and disparity of value magnitude. GPUs and more advanced scalar compilers will reorder operations to enhance parallelism, so some of this variability in lower order digits is to be expected. Using increased floating point precision (e.g.: FP64) can increase CPU/GPU agreement but at the cost of performance and memory consumption.

DonRomaniello commented 2 years ago

@williamsdoug

Thank you for the breakdown.

So, it looks like if you're willing to trade off speed for removing this artifact, I found that moving the dataloaders onto the CPU before training allows for an export and import without any changes in the predictions.

learn.dls = learn.dls.cpu()

Before training led to the results being the same after exporting and importing.

oguiza commented 2 years ago

Hi @DonRomaniello, Moving the dataloaders to cpu is not a practical solution as the training would happen on the CPU. If you want to get the exact same predictions as in training the only thing you need to do is set CPU to False when loading the learner:

learn = load_learner(PATH, cpu=False)

If for any reason you can't do that, you need to understand that there'll be a very minor difference between training and your predictions. Max difference usually less than 1e-5.

I think we have debated this and found the root cause of the difference and the way to avoid it. This is clearly not a tsai-related issue. Are you ok if I close this issue? Or should I move it to discussions?

DonRomaniello commented 2 years ago

@oguiza

I agree that the issue is not tsai-related, but could we move it to discussions? Even though the issue is outside of tsai, it might be interesting to keep pursuing this.

I'm wondering if I can find a way to do most of the training on the GPU, move it to the CPU, then run a few more cycles to try to tune it better.

On the dummy data the differences were pretty small, but on my dataset the differences end up washing out some pretty significant trends that had been spot on when CPU-CPU or GPU-GPU.