timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.18k stars 646 forks source link

ROCKET out of memory for create_rocket_features #39

Closed geoHeil closed 3 years ago

geoHeil commented 3 years ago

When trying to compute the rocket featres it fails for me with an CUDA out of memory error:

X_train, y_train = create_rocket_features(dls.train, model)
X_valid, y_valid = create_rocket_features(dls.valid, model)
X_train.shape, X_valid.shape

RuntimeError: CUDA out of memory. Tried to allocate 2.46 GiB (GPU 0; 15.90 GiB total capacity; 2.46 GiB already allocated; 2.46 GiB free; 3.41 GiB reserved in total by PyTorch)

for ROCKET on a Nvidia P100.

The data is loaded using:

dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[64, 128], batch_tfms=[TSStandardize(by_var=True)], num_workers=0, drop_last=False, shuffle_train=False)# images defined

model = create_model(ROCKET, dls=dls, n_kernels=500, kss=[7]) # n_kernels=10_000, kss=[7, 9, 11] by default, 

and I already try to reduce the number of kernels and kss. But somehow, this still fails and is not working / still running out of memory. This is even true when reducing the batch size to 8/16.

NOTICE: on disk the numpy array is approximately 70GB in size.

Maybe I am creating too many windows? Of an dataframe with approx 50 columns, 14. million records (panel data), 6GB in size according to pandas, I use:

window_length = 48
get_x = ['x1', ... 'x50']
get_y = 'target'

def y_func(o): return (o.sum(axis=1) > 0).astype(int)
X, y = SlidingWindowPanel(window_length, ['panel_device_id'], stride=5, get_x=get_x, get_y=get_y, y_func=y_func,
 horizon=0, seq_first=True, sort_by=['hour'], ascending=True, check_leakage=True, return_key=False, verbose=True)(df)

to generate the sliding window of length 48 hours which is sliding over every 5 hours. Perhaps I should decrease the number of windows? But I find it strange that neither batch size or reduction of features helped to solve the problem.

geoHeil commented 3 years ago

Indeed, using a less overlapping sliding window with much less data works fine. I guess this issue should potentially remain open as a feature request to support mini batches for the ROCKET feature creation step.

oguiza commented 3 years ago

Hi @geoHeil, There's a way to handle large amounts of data if you use "np.memap" arrays and build a DL model to process the calculated features. I don't have much time to fully detail the process now, but this is what you'd need to do at a high level:

  1. Use SlidingWindow to prepare your data.
  2. Store you output as np.memmap array so that you can work with your data on_disk instead of in memory. You can get more details in this tutorial nb: https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/00_How_to_efficiently_work_with_very_large_numpy_arrays.ipynb
  3. Build a DL model to process a batch at a time like it's demonstrated in this other tutorial nb: https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/02_ROCKET_a_new_SOTA_classifier.ipynb
  4. That allows you to handle very large datasets, as you only upload to memory individual batches.

If you try this approach, please let me know how it works for you. I use np.memmap arrays all the time with my data and work great.

geoHeil commented 3 years ago

I have to check. Currently, I needed to reduce the kernels to 1000 - but then a prediction using i.e. xgboost is barely predicting any labels of the minority class if at all.

Regarding your suggestions: do I understand correctly that you mean to use create_rocket_features (your custom PyTorch implementation) with mini-batches in a custom implementation of that function? My machine has 512 GB RAM - and so far regular RAM is not a problem at all, rather the P100 with only 15GB is crashing with OOM. If I understand it correctly the first part of using the memory-mapped file could thus be skipped and only the mini-batches (3) in your answer would be needed.

That allows you to handle very large datasets, as you only upload to memory individual batches.

So do you mean that I would need to decrease the batch size further? I.e. lower than: s=[64, 128]

oguiza commented 3 years ago

Hi, I've noticed there was an issue with create_rocket_features that I've now fixed in the Github repo. Sorry about that. I've also modified the API, and I believe it's not even simpler to use. So far I've only updated it in the repo. I will include it in the next release when it becomes available. Here's a minimal example of how you may use it.

X, y, splits = get_UCR_data('OliveOil', split_data=False)
tfms = [None, TSRegression()]
batch_tfms = TSStandardize(by_var=True)
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
X_train.shape, X_valid.shape

Please, let me know if it works well now.

oguiza commented 3 years ago

I forgot to mention that with the large dataset you have, you may consider fully eliminating overlapping, or set it to 50% of the window length. That’d significantly reduce the dataset size, and shouldn’t impact results too much. Those are commonly used settings.

geoHeil commented 3 years ago

When testing it:

X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
X_train.shape, X_valid.shape
...............................

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-a76fb33126ed> in <module>
----> 1 X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
      2 X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
      3 X_train.shape, X_valid.shape

~/development/conda_envs/my_env/lib/python3.8/site-packages/tsai/models/ROCKET.py in create_rocket_features(dl, n_kernels, kss, device)
    134         _x_out = model(xb).detach().cpu().numpy()
    135         _y_out = yb.detach().cpu().numpy()
--> 136         x_out = _x_out if i == 0 else torch.cat([x_out, _x_out])
    137         y_out = _y_out if i == 0 else torch.cat([y_out, _y_out])
    138     return x_out, y_out

TypeError: expected Tensor as element 0 in argument 0, but got numpy.ndarray

though I was using:

splits = get_splits(y, valid_size=.2, stratify=True, random_state=47, shuffle=False)
tfms  = [None, [Categorize()]]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)

and not:

dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms)

but I would guess it should give an equal result. Indeed, this fails as well with the same error.

oguiza commented 3 years ago

Oh, there was a mistake in the code. Please, paste this function and try it again:

def create_rocket_features(dl, n_kernels=10_000, kss=[7, 9, 11], device=None):
    """Args:

        dl        : single TSDataLoader (for example dls.train or dls.valid)
        n_kernels : number of kernels created in ROCKET
        kss       : filter sizes used by ROCKET    
    """
    model = ROCKET(dl.vars, dl.len, n_kernels=n_kernels, kss=kss, device=device)
    for i,(xb,yb) in enumerate(progress_bar(dl)):
        _x_out = model(xb).detach().cpu()
        _y_out = yb.detach().cpu()
        x_out = _x_out if i == 0 else torch.cat([x_out, _x_out])
        y_out = _y_out if i == 0 else torch.cat([y_out, _y_out])
    return x_out.numpy(), y_out.numpy()
geoHeil commented 3 years ago

Interesting: I am operating on a 24 hour sliding window (no overlaps) with approx 16GB of data on disk. (after applying the sliding window operator) Your previous ROCKET was rather fast. But now, the ROCKEt progressbar is showing me 9 hours of runtime. I will report if the OOMs were fixed now though. Maybe I should increase the batch size. For now I will leave it running with 64/128 batchsizes.

geoHeil commented 3 years ago

I swichted to ha higher batchsize of 1024. The new function is much more memory efficient! I guess far larger batch sizes should also work now.

oguiza commented 3 years ago

IMPORTANT

Hi @geoHeil, I realized this morning there is a critical bug in the create_rocket_features I sent you. The issue is that the model is created within the function, which means it will be different for train and valid, thus creating random features every time. I've fixed this but now and updated it in Github. In addition to that, I've made a few other changes:

geoHeil commented 3 years ago

Many thanks!

I have used it like:

dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[8192, 16384], batch_tfms=[TSStandardize(by_var=True)], num_workers=0, drop_last=False, shuffle_train=False)# images defined
model = build_ts_model(ROCKET, dls=dls) # this will create the model outside the function, and you can save it if necessary

Notice: 73bab0012ad7cf5db9702138ece7537f40dbe047 was used - so it should already include your latest (and fixed) version of the function.

However:

X_train, y_train = create_rocket_features(dls.train, model) 
X_valid, y_valid = create_rocket_features(dls.valid, model)

Fails with:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-3d70312f15b5> in <module>
----> 1 X_train, y_train = create_rocket_features(dls.train, model)
      2 X_valid, y_valid = create_rocket_features(dls.valid, model)

~/development/conda_envs/foo/lib/python3.8/site-packages/tsai/models/ROCKET.py in create_rocket_features(dl, n_kernels, kss, device)
    130         kss       : filter sizes used by ROCKET
    131     """
--> 132     model = ROCKET(dl.vars, dl.len, n_kernels=n_kernels, kss=kss, device=device)
    133     for i,(xb,yb) in enumerate(progress_bar(dl)):
    134         _x_out = model(xb).detach().cpu().numpy()

~/development/conda_envs/foo/lib/python3.8/site-packages/tsai/models/ROCKET.py in __init__(self, c_in, seq_len, n_kernels, kss, device)
     97         kss = [ks for ks in kss if ks < seq_len]
     98         convs = nn.ModuleList()
---> 99         for i in range(n_kernels):
    100             ks = np.random.choice(kss)
    101             dilation = 2**np.random.uniform(0, np.log2((seq_len - 1) // (ks - 1)))

TypeError: 'ROCKET' object cannot be interpreted as an integer
geoHeil commented 3 years ago

Wait - something must have not worked with regards to the update - it looks like it is still referring to the old function. Let me double check this.

geoHeil commented 3 years ago

Indeed, this is computing the features now. I can use bs=[8192, 16384] successfully with approx 4GB memory utilization. 100_000 requires approx 12GB and takes approx 15 minutes for a batch - and thus approx 1 hour to create the rocket features but works great without any OOM now.