Closed DonRomaniello closed 2 years ago
How are you doing the inference?
Thank you for responding, and I hope this is what you're looking for:
I have a pandas dataset, and I feed sixty time steps and ask for the next 30 for one column. Here is the code:
columnNumber = 409
get_y = docks.columns[columnNumber]
lookAhead = 30
window = 60
X, y = SlidingWindow(window, stride=None, horizon=lookAhead, seq_first=True, get_y=get_y)(docks)
validationPercent = .3
splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=True)
tfms = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X,
y,
splits=splits,
tfms=tfms,
batch_tfms=batch_tfms,
bs=[int(np.round((len(X) / 4))), int(np.round(((len(X) / 4) * validationPercent)))])
learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learningRate = learn.lr_find()[0]
learn.fit_one_cycle(300, learningRate)
The predictions are really great, but getting them ready for the road is the challenge I'm having. I've tried different architectures, to no avail.
that code is for the training, what about the inference?
Ah, yes of course.
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
a, b, c = learn.get_X_preds(X)
so, if you do learn.get_X_preds(X)
before exporting/importing, you have different results than doing the same after exporting/importing?
Hi @DonRomaniello, I've created a quick test and the code seems to work well. Here's the snippet I've created:
X, y, splits = get_regression_data('Covid3Month', split_data=False)
y_multistep = y.reshape(-1,1).repeat(3, 1) # repeat steps to simulate a 3 step forecast
tfms = [None, TSRegression()]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X, y_multistep, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learn.fit_one_cycle(2)
p, *_ = learn.get_X_preds(X)
print(p.shape)
torch.Size([201, 3]) # output
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)
torch.Size([201, 3]) # output True
I'm not sure if you are following a different process, but his is working well.
so, if you do
learn.get_X_preds(X)
before exporting/importing, you have different results than doing the same after exporting/importing?
@vrodriguezf, correct.
I'm not sure if you are following a different process, but his is working well.
@oguiza
The code you shared is working for the example you provided, however when I applied it to my code it isn't having the same effect.
It’d be good if you can find the difference between your code and the one I shared. I’m not sure where the issue is coming from.
@oguiza
The only thing I can think of is that I am using SlidingWindow and get_splits, but if the dataloaders stay the same shouldn't the model have similar predictions?
That shouldn’t have an impact on the saved learner. Could you please use check_data(X, y, splits) and share the output?
Sure thing:
X - shape: [718 samples x 1582 features x 60 timesteps] type: ndarray dtype:float64 isnan: 0
y - shape: (718, 30) type: ndarray dtype:float64 isnan: 0
splits - n_splits: 2 shape: [503, 215] overlap: [False]
I don’t see anything strange. I’m sorry but I don’t know how to help.
Thank you, and to be honest I am relieved that I wasn't missing something.
I'll try manually creating the sliding windows and see if that does anything.
Actually, I wonder if this sheds any light:
When I load the model and then try to fit_one_cycle, I get this:
ZeroDivisionError Traceback (most recent call last)
/tmp/ipykernel_1923/2899536512.py in <module>
----> 1 learn.fit_one_cycle(10)
2 beep()
~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
114 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
115 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
117
118 # Cell
~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
219 self.opt.set_hypers(lr=self.lr if lr is None else lr)
220 self.n_epoch = n_epoch
--> 221 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
222
223 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None
~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
161
162 def _with_events(self, f, event_type, ex, final=noop):
--> 163 try: self(f'before_{event_type}'); f()
164 except ex: self(f'after_cancel_{event_type}')
165 self(f'after_{event_type}'); final()
~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
210 for epoch in range(self.n_epoch):
211 self.epoch=epoch
--> 212 self._with_events(self._do_epoch, 'epoch', CancelEpochException)
213
214 def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):
~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
161
162 def _with_events(self, f, event_type, ex, final=noop):
--> 163 try: self(f'before_{event_type}'); f()
164 except ex: self(f'after_cancel_{event_type}')
165 self(f'after_{event_type}'); final()
~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch(self)
204
205 def _do_epoch(self):
--> 206 self._do_epoch_train()
207 self._do_epoch_validate()
208
~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch_train(self)
196 def _do_epoch_train(self):
197 self.dl = self.dls.train
--> 198 self._with_events(self.all_batches, 'train', CancelTrainException)
199
200 def _do_epoch_validate(self, ds_idx=1, dl=None):
~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
161
162 def _with_events(self, f, event_type, ex, final=noop):
--> 163 try: self(f'before_{event_type}'); f()
164 except ex: self(f'after_cancel_{event_type}')
165 self(f'after_{event_type}'); final()
~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
139
140 def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141 def __call__(self, event_name): L(event_name).map(self._call_one)
142
143 def _call_one(self, event_name):
~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
153 def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
154
--> 155 def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
156 def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
157 def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))
~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
696 res = map(g, iterable)
697 if gen: return res
--> 698 return list(res)
699
700 # Cell
~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
681 if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
682 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683 return self.func(*fargs, **kwargs)
684
685 # Cell
~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
143 def _call_one(self, event_name):
144 if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145 for cb in self.cbs.sorted('order'): cb(event_name)
146
147 def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)
~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
43 (self.run_valid and not getattr(self, 'training', False)))
44 res = None
---> 45 if self.run and _run: res = getattr(self, event_name, noop)()
46 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
47 return res
~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in before_train(self)
23 if getattr(self, 'mbar', False): self.mbar.update(self.epoch)
24
---> 25 def before_train(self): self._launch_pbar()
26 def before_validate(self): self._launch_pbar()
27 def after_train(self): self.pbar.on_iter_end()
~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in _launch_pbar(self)
32
33 def _launch_pbar(self):
---> 34 self.pbar = progress_bar(self.dl, parent=getattr(self, 'mbar', None), leave=False)
35 self.pbar.update(0)
36
~/.local/lib/python3.8/site-packages/fastprogress/fastprogress.py in __init__(self, gen, total, display, leave, parent, master, comment)
17 def __init__(self, gen, total=None, display=True, leave=True, parent=None, master=None, comment=''):
18 self.gen,self.parent,self.master,self.comment = gen,parent,master,comment
---> 19 self.total = len(gen) if total is None else total
20 self.last_v = 0
21 if parent is None: self.leave,self.display = leave,display
~/.local/lib/python3.8/site-packages/fastai/data/load.py in __len__(self)
92 if self.n is None: raise TypeError
93 if self.bs is None: return self.n
---> 94 return self.n//self.bs + (0 if self.drop_last or self.n%self.bs==0 else 1)
95
96 def get_idxs(self):
ZeroDivisionError: integer division or modulo by zero
It seems that the batch size has been lost somehow. Try setting it manually (learn.dls.train.bs
and learn.dls.valid.bs
) and see if that helps
When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.
I'm curious when you say predictions are different, what do you mean? are they still created but with different values? Could you please re-run the code I sent you before with your X, y and splits and share the output?
@vrodriguezf
<It seems that the batch size has been lost somehow. Try setting it manually (learn.dls.train.bs and learn.dls.valid.bs) and see if that helps>
It helped push the problem down the road a little...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_2592/2743683231.py in <module>
----> 1 learn.fit_one_cycle(100)
2 beep()
~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
114 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
115 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
117
118 # Cell
~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
219 self.opt.set_hypers(lr=self.lr if lr is None else lr)
220 self.n_epoch = n_epoch
--> 221 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
222
223 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None
~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
161
162 def _with_events(self, f, event_type, ex, final=noop):
--> 163 try: self(f'before_{event_type}'); f()
164 except ex: self(f'after_cancel_{event_type}')
165 self(f'after_{event_type}'); final()
~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
210 for epoch in range(self.n_epoch):
211 self.epoch=epoch
--> 212 self._with_events(self._do_epoch, 'epoch', CancelEpochException)
213
214 def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):
~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
163 try: self(f'before_{event_type}'); f()
164 except ex: self(f'after_cancel_{event_type}')
--> 165 self(f'after_{event_type}'); final()
166
167 def all_batches(self):
~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
139
140 def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141 def __call__(self, event_name): L(event_name).map(self._call_one)
142
143 def _call_one(self, event_name):
~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
153 def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
154
--> 155 def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
156 def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
157 def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))
~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
696 res = map(g, iterable)
697 if gen: return res
--> 698 return list(res)
699
700 # Cell
~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
681 if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
682 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683 return self.func(*fargs, **kwargs)
684
685 # Cell
~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
143 def _call_one(self, event_name):
144 if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145 for cb in self.cbs.sorted('order'): cb(event_name)
146
147 def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)
~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
43 (self.run_valid and not getattr(self, 'training', False)))
44 res = None
---> 45 if self.run and _run: res = getattr(self, event_name, noop)()
46 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
47 return res
~/.local/lib/python3.8/site-packages/tsai/callback/core.py in after_epoch(self)
88 x_bounds = (0, len(rec.losses))
89 if self.epoch == 0:
---> 90 y_min = min((min(rec.losses), min(val_losses)))
91 y_max = max((max(rec.losses), max(val_losses)))
92 else:
ValueError: min() arg is an empty sequence
@oguiza <When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.>
I've tried recreating the learner and then simply replacing the model with
learn2 = load_learner(PATH, cpu=True)
learn.model = learn2.model
but the predictions have different values from the same input data.
I've rerun the code you provided on my data, here is the output:
p, *_ = learn.get_X_preds(X)
print(p.shape)
torch.Size([479, 30])
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)
torch.Size([479, 30]) False
Same problem even when manually setting the dataloaders with @vrodriguezf's method.
Seriously, thank you for helping with this.
I'm afraid I'm unable to help. It'd be good if you can recreate the issue with some dummy data.
@oguiza
Same problem with dummy data:
emptyArray1 = []
for range in np.arange(0,.9,.001):
emptyArray1.append(np.arange(range,(range + .1),.0001))
dummy = pd.DataFrame(emptyArray1)
dummy[(len(dummy.columns) - 1)] = dummy.mean(axis=1)
columnNumber = (len(dummy.columns) - 1)
get_y = dummy.columns[columnNumber]
lookAhead = 5
window = 10
X, y = SlidingWindow(window, stride=(lookAhead + window), horizon=lookAhead, seq_first=True, get_y=get_y)(dummy)
validationPercent = .3
splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=False)
check_data(X, y, splits)
tfms = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
train_batch = int(np.round((len(X))))
valid_batch = int(np.round(((len(X)) * validationPercent)))
dls = get_ts_dls(X,
y,
splits=splits,
tfms=tfms,
batch_tfms=batch_tfms,
bs=[train_batch, valid_batch])
optimizer = Adam
learn = ts_learner(dls, arch=InceptionTimePlus, metrics=[mae, rmse], cbs=[ShowGraph()], opt_func=optimizer)
learningRate = learn.lr_find()[0]
learn.fit_one_cycle(200, learningRate)
p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))
torch.Size([60, 5]) 0.2932287153601242
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)
torch.Size([60, 5]) 0.2932296804365857 False
Although, the MSE is much closer than with my data.
Ok, I've tried it and while it's true that there's a difference between the predictions, it's minor. did this:
torch.max(p - p2)
and the max diff is tensor(1.2338e-05). I don't know where this comes from. Sometimes this is due to conversion between types.
Edit: I've also tried it with the data and code I sent you before and the difference torch.max(torch.abs(p - p2)) = tensor(2.9802e-08)
I've found the root cause. There is a difference because the learner initially creates the predictions on the GPU. When you load the model it creates them on the CPU. If you change it to cpu=False, then there's no difference. There must be a Pytorch difference between tensors in GPU and cuda.
@oguiza
I've found the root cause.
OK, so it sounds like if I want to deploy this on a CPU, I have to train it on a CPU?
Well... I did a little test, and am not sure if the GPU to CPU change is the issue:
p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))
torch.Size([2999, 5]) 0.256418886837684
learn.model = learn.model.cpu()
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)
torch.Size([2999, 5]) 0.256418886837684 True
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p3, *_ = learn.get_X_preds(X)
print(p3.shape, skm.mean_squared_error(y, p3, squared=False))
torch.equal(p2, p3)
torch.Size([2999, 5]) 0.2570435682650654 False
Obviously the difference is very small, but it is interesting that the issue seems to happen somewhere in here:
PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
Edit: Hold on, are the dataloaders also on the GPU?
I've isolated the issue, it's the dataloaders going from GPU to CPU.
p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))
torch.Size([2999, 5]) 0.5691404370332382
learn.model = learn.model.cpu()
learn.dls = learn.dls.cpu()
p2, *_ = learn.get_X_preds(X) print(p2.shape, skm.mean_squared_error(y, p2, squared=False)) torch.equal(p, p2)
torch.Size([2999, 5]) 0.5692150288534804
False
I will try training the model on GPU with the dataloaders on CPU this evening when EC2 capacity is available and will report back.
Thank you @oguiza and @vrodriguezf for all the help.
Hi @DonRomaniello. An issue that can occur when going between CPU and GPU is ordering sensitivity for floating point numbers, particularly with respect to summation operations. Below is a simple example
>>> import numpy as np
>>> A = np.array([1/3], dtype=np.float32)
>>> B = np.array([100000/3], dtype=np.float32)
>>> A - B + B
array([0.33203125], dtype=float32)
>>> A + (-B + B)
array([0.33333334], dtype=float32)
>>> B - B + A
array([0.33333334], dtype=float32)
>>> B +(- B + A)
array([0.33203125], dtype=float32)
In theory the associativity principle should yield the same answer in all of the above cases, however, limited mantissa precision can result in differences in the least significant digits depending on the order of evaluation and disparity of value magnitude. GPUs and more advanced scalar compilers will reorder operations to enhance parallelism, so some of this variability in lower order digits is to be expected. Using increased floating point precision (e.g.: FP64) can increase CPU/GPU agreement but at the cost of performance and memory consumption.
@williamsdoug
Thank you for the breakdown.
So, it looks like if you're willing to trade off speed for removing this artifact, I found that moving the dataloaders onto the CPU before training allows for an export and import without any changes in the predictions.
learn.dls = learn.dls.cpu()
Before training led to the results being the same after exporting and importing.
Hi @DonRomaniello, Moving the dataloaders to cpu is not a practical solution as the training would happen on the CPU. If you want to get the exact same predictions as in training the only thing you need to do is set CPU to False when loading the learner:
learn = load_learner(PATH, cpu=False)
If for any reason you can't do that, you need to understand that there'll be a very minor difference between training and your predictions. Max difference usually less than 1e-5.
I think we have debated this and found the root cause of the difference and the way to avoid it. This is clearly not a tsai-related issue. Are you ok if I close this issue? Or should I move it to discussions?
@oguiza
I agree that the issue is not tsai-related, but could we move it to discussions? Even though the issue is outside of tsai, it might be interesting to keep pursuing this.
I'm wondering if I can find a way to do most of the training on the GPU, move it to the CPU, then run a few more cycles to try to tune it better.
On the dummy data the differences were pretty small, but on my dataset the differences end up washing out some pretty significant trends that had been spot on when CPU-CPU or GPU-GPU.
I've been getting great results on forecasting a multistep horizon for a multivariate time series, but am having a lot of trouble exporting or saving the model to use on other machines or even in the same Jupyter Notebook.
I create the learner with ts_learner, train it, but when I use learner.save or learner.export, the imported model doesn't have the same predictions.
Any help would be appreciated.