timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.19k stars 649 forks source link

Calling learner.feature_importance on larger than memory dataset causes OOM #310

Closed scottcha closed 2 years ago

scottcha commented 2 years ago

Repro steps:

  1. Train a model with larger than memory data
  2. Call learn.feature_importance()

Expected result: show feature importance of features Actual result: OOM--full repro and notebook here: https://github.com/scottcha/TsaiOOMRepro/blob/main/TsaiOOMRepro.ipynb

os : Linux-5.4.0-91-generic-x86_64-with-glibc2.17 python : 3.8.11 tsai : 0.2.24 fastai : 2.5.3 fastcore : 1.3.26 zarr : 2.10.0 torch : 1.9.1+cu102 n_cpus : 24 device : cuda (GeForce GTX 1080 Ti)

Stack Trace:

MemoryError Traceback (most recent call last) /tmp/ipykernel_3968/3713785271.py in ----> 1 learn.feature_importance()

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/learner.py in feature_importance(self, feature_names, key_metric_idx, show_chart, save_df_path, random_state) 337 value = self.get_X_preds(X_valid, y_valid, with_loss=True)[-1].mean().item() 338 else: --> 339 output = self.get_X_preds(X_valid, y_valid) 340 value = metric(output[0], output[1]).item() 341 print(f"{k:3} feature: {COLS[k]:20} {metric_name}: {value:8.6f}")

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/inference.py in get_X_preds(self, X, y, bs, with_input, with_decoded, with_loss) 16 print("cannot find loss as y=None") 17 with_loss = False ---> 18 dl = self.dls.valid.new_dl(X, y=y) 19 if bs: setattr(dl, "bs", bs) 20 else: assert dl.bs, "you need to pass a bs != 0"

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in new_dl(self, X, y) 486 assert X.ndim == 3, "You must pass an X with 3 dimensions [batch_size x n_vars x seq_len]" 487 if y is not None and not is_array(y) and not is_listy(y): y = [y] --> 488 new_dloader = self.new(self.dataset.add_dataset(X, y=y)) 489 return new_dloader 490

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_dataset(self, X, y, inplace) 422 @patch 423 def add_dataset(self:NumpyDatasets, X, y=None, inplace=True): --> 424 return add_ds(self, X, y=y, inplace=inplace) 425 426 @patch

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_ds(dsets, X, y, inplace) 413 tls = dsets.tls if with_labels else dsets.tls[:dsets.n_inp] 414 new_tls = L([tl._new(item, split_idx=1) for tl,item in zip(tls, items)]) --> 415 return type(dsets)(tls=new_tls) 416 elif isinstance(dsets, TfmdLists): 417 new_tl = dsets._new(items, split_idx=1)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in init(self, X, y, items, sel_vars, sel_steps, tfms, tls, n_inp, dl_type, inplace, **kwargs) 378 if len(self.tls) > 0 and len(self.tls[0]) > 0: 379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)] --> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:])) \ 381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls 382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in (.0) 378 if len(self.tls) > 0 and len(self.tls[0]) > 0: 379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)] --> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:])) \ 381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls 382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in getitem(self, it) 243 def subset(self, i, kwargs): return type(self)(self.items, splits=self.splits[i], split_idx=i, do_setup=False, types=self.types, kwargs) 244 def getitem(self, it): --> 245 if hasattr(self.items, 'oindex'): return self.items.oindex[self._splits[it]] 246 else: return self.items[self._splits[it]] 247 def len(self): return len(self._splits)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/indexing.py in getitem(self, selection) 602 selection = ensure_tuple(selection) 603 selection = replace_lists(selection) --> 604 return self.array.get_orthogonal_selection(selection, fields=fields) 605 606 def setitem(self, selection, value):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in get_orthogonal_selection(self, selection, out, fields) 939 indexer = OrthogonalIndexer(selection, self) 940 --> 941 return self._get_selection(indexer=indexer, out=out, fields=fields) 942 943 def get_coordinate_selection(self, selection, out=None, fields=None):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields) 1107 # setup output array 1108 if out is None: -> 1109 out = np.empty(out_shape, dtype=out_dtype, order=self._order) 1110 else: 1111 check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 315. GiB for an array with shape (60000, 978, 1441) and data type float32

oguiza commented 2 years ago

Hi @scottcha, Thanks for taking the time to report his bug. This issue is created when each feature is shuffled. To shuffle data it needs to be loaded in memory. By default, feature_importance uses all data in the validation split. This makes it only usable with in-memory datasets. There are a few alternatives to fix this issue:

  1. Add X and y as optional arguments. Then feature importance will be measured on the X and y you pass instead of the entire dataset.
  2. Add partial_n as an optional argument (int or float, like in the dataloaders). In this way, you could indicate either a fixed number of samples with an int (1000 samples) or a percent of the validation set.
  3. Add X, y, and partial_n, so that you can use X & y or partial_n.

I think option 3 would probably cover most scenarios as it's the most flexible. What do you think Scott?

scottcha commented 2 years ago

@oguiza I agree 3 is the most flexible. I tried out #1 as a workaround but I run in to a separate memory issue in the loop doing the feature importance calcs:

My entire chrome session running jupyter crashes with this error:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Each of my samples is about .5mb to 1mb on disk and I encounter this error even when only computing with 100 samples--since I have ~900 features it goes through the calculation that many times but seems to hit this around iteration 50.

Even though monitoring my system ram (which does grow aggressively during this at approximately 1gb per iteration in the feature importance calculation and my gpu ram seems constant but obviously something seems to be leaking or growing out of control). My guess its related to some of the gpu allocated objects not getting freed but I wasn't sure how to debug that.

Also, FWIW I ran this outside of jupyter in VS Code python debugger and get the same error with the one additional piece of information that it indicates "Dataloader Worker (PID(s) 1618) Exited Unexpectedly".

Thanks

oguiza commented 2 years ago

Hi @scottcha, Thanks for providing more details on your issue. I've updated feature_importance now and get_X_preds to ensure as much non-required data is removed (using gc.collect). Please, try it again if you can, and let me know if you still have issues.

scottcha commented 2 years ago

I tried out the new implementation. Here are a couple of notes:

  1. When I provide my own smaller X, y parameters I still get the crash at about the 50th iteration of calculating the feature importance as well as high system memory usage.
  2. The current logic to slice X doesn't seem to work with native zarr arrays. I believe in the case where X is a zarr array this would be the right way to slice it based on a set of random indices: X = X.get_orthogonal_selection((rand_idxs, slice(None), slice(None)))
oguiza commented 2 years ago

Hi @scottcha, I need to adapt feature_importance to work with zarr arrays as you mention. I'll fix it within the next few days. But I'm not exactly sure what's causing the issue in your bullet point #1. Could you please try to export the loader once it's trained and reload it using load_learner. If you do that it will contain no data. You can then pass a smaller array and see if the issue persists. That'll give us a hint at what the root cause might be.

scottcha commented 2 years ago

Sorry it took me a bit to get back to this. I refreshed my env with the latest and reran my use case (large zarr file, sliced before calling feature_importance) and I was able to complete the run without encountering the OOM or the shared memory errors. So I would say at this point the issues I called out are resolved or not reproducible with the exception that it may not natively handle zarr arrays though that's pretty easy to work around.

Thanks!

oguiza commented 2 years ago

Ok, I'm glad to hear that Scott. I forgot to fix the indexing for zarr arrays. I've added it now in the GitHub repo. It works when you pass a partial_n (int or float) since the data doesn't fit in memory. If you pass an X it needs to be a numpy array. If you have a chance, it'd be good if you can test it (use pip install -Uqq git+https://github.com/timeseriesAI/tsai.git).

oguiza commented 2 years ago

I'll close this issue since the requested fix has already been implemented. Please, reopen it if necessary.