Closed geoHeil closed 3 years ago
Indeed, using a less overlapping sliding window with much less data works fine. I guess this issue should potentially remain open as a feature request to support mini batches for the ROCKET feature creation step.
Hi @geoHeil, There's a way to handle large amounts of data if you use "np.memap" arrays and build a DL model to process the calculated features. I don't have much time to fully detail the process now, but this is what you'd need to do at a high level:
If you try this approach, please let me know how it works for you. I use np.memmap arrays all the time with my data and work great.
I have to check. Currently, I needed to reduce the kernels to 1000 - but then a prediction using i.e. xgboost is barely predicting any labels of the minority class if at all.
Regarding your suggestions: do I understand correctly that you mean to use create_rocket_features (your custom PyTorch implementation) with mini-batches in a custom implementation of that function? My machine has 512 GB RAM - and so far regular RAM is not a problem at all, rather the P100 with only 15GB is crashing with OOM. If I understand it correctly the first part of using the memory-mapped file could thus be skipped and only the mini-batches (3) in your answer would be needed.
That allows you to handle very large datasets, as you only upload to memory individual batches.
So do you mean that I would need to decrease the batch size further? I.e. lower than: s=[64, 128]
Hi,
I've noticed there was an issue with create_rocket_features
that I've now fixed in the Github repo. Sorry about that.
I've also modified the API, and I believe it's not even simpler to use. So far I've only updated it in the repo. I will include it in the next release when it becomes available.
Here's a minimal example of how you may use it.
X, y, splits = get_UCR_data('OliveOil', split_data=False)
tfms = [None, TSRegression()]
batch_tfms = TSStandardize(by_var=True)
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
X_train.shape, X_valid.shape
Please, let me know if it works well now.
I forgot to mention that with the large dataset you have, you may consider fully eliminating overlapping, or set it to 50% of the window length. That’d significantly reduce the dataset size, and shouldn’t impact results too much. Those are commonly used settings.
When testing it:
X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
X_train.shape, X_valid.shape
...............................
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-a76fb33126ed> in <module>
----> 1 X_train, y_train = create_rocket_features(dls.train, n_kernels=10_000)
2 X_valid, y_valid = create_rocket_features(dls.valid, n_kernels=10_000)
3 X_train.shape, X_valid.shape
~/development/conda_envs/my_env/lib/python3.8/site-packages/tsai/models/ROCKET.py in create_rocket_features(dl, n_kernels, kss, device)
134 _x_out = model(xb).detach().cpu().numpy()
135 _y_out = yb.detach().cpu().numpy()
--> 136 x_out = _x_out if i == 0 else torch.cat([x_out, _x_out])
137 y_out = _y_out if i == 0 else torch.cat([y_out, _y_out])
138 return x_out, y_out
TypeError: expected Tensor as element 0 in argument 0, but got numpy.ndarray
though I was using:
splits = get_splits(y, valid_size=.2, stratify=True, random_state=47, shuffle=False)
tfms = [None, [Categorize()]]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)
and not:
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
but I would guess it should give an equal result. Indeed, this fails as well with the same error.
Oh, there was a mistake in the code. Please, paste this function and try it again:
def create_rocket_features(dl, n_kernels=10_000, kss=[7, 9, 11], device=None):
"""Args:
dl : single TSDataLoader (for example dls.train or dls.valid)
n_kernels : number of kernels created in ROCKET
kss : filter sizes used by ROCKET
"""
model = ROCKET(dl.vars, dl.len, n_kernels=n_kernels, kss=kss, device=device)
for i,(xb,yb) in enumerate(progress_bar(dl)):
_x_out = model(xb).detach().cpu()
_y_out = yb.detach().cpu()
x_out = _x_out if i == 0 else torch.cat([x_out, _x_out])
y_out = _y_out if i == 0 else torch.cat([y_out, _y_out])
return x_out.numpy(), y_out.numpy()
Interesting: I am operating on a 24 hour sliding window (no overlaps) with approx 16GB of data on disk. (after applying the sliding window operator) Your previous ROCKET was rather fast. But now, the ROCKEt progressbar is showing me 9 hours of runtime. I will report if the OOMs were fixed now though. Maybe I should increase the batch size. For now I will leave it running with 64/128 batchsizes.
I swichted to ha higher batchsize of 1024. The new function is much more memory efficient! I guess far larger batch sizes should also work now.
IMPORTANT
Hi @geoHeil, I realized this morning there is a critical bug in the create_rocket_features I sent you. The issue is that the model is created within the function, which means it will be different for train and valid, thus creating random features every time. I've fixed this but now and updated it in Github. In addition to that, I've made a few other changes:
X, y, splits = get_UCR_data('OliveOil', split_data=False)
tfms = [None, TSRegression()] # TSRegression for regression, TSForecasting for forecasting, TSClassification for classification
batch_tfms = TSStandardize(by_var=True) # as indicated by the authors
dls = get_ts_dls(X, y, splits=splits, tfms=tfms, batch_tfms=batch_tfms, shuffle_train=False, drop_last=False) # there's no need to shuffle data and you want to use all data
model = build_ts_model(ROCKET, dls=dls) # this will create the model outside the function, and you can save it if necessary
X_train, y_train = create_rocket_features(dls.train, model)
X_valid, y_valid = create_rocket_features(dls.valid, model)
Many thanks!
I have used it like:
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=[8192, 16384], batch_tfms=[TSStandardize(by_var=True)], num_workers=0, drop_last=False, shuffle_train=False)# images defined
model = build_ts_model(ROCKET, dls=dls) # this will create the model outside the function, and you can save it if necessary
Notice: 73bab0012ad7cf5db9702138ece7537f40dbe047
was used - so it should already include your latest (and fixed) version of the function.
However:
X_train, y_train = create_rocket_features(dls.train, model)
X_valid, y_valid = create_rocket_features(dls.valid, model)
Fails with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-3d70312f15b5> in <module>
----> 1 X_train, y_train = create_rocket_features(dls.train, model)
2 X_valid, y_valid = create_rocket_features(dls.valid, model)
~/development/conda_envs/foo/lib/python3.8/site-packages/tsai/models/ROCKET.py in create_rocket_features(dl, n_kernels, kss, device)
130 kss : filter sizes used by ROCKET
131 """
--> 132 model = ROCKET(dl.vars, dl.len, n_kernels=n_kernels, kss=kss, device=device)
133 for i,(xb,yb) in enumerate(progress_bar(dl)):
134 _x_out = model(xb).detach().cpu().numpy()
~/development/conda_envs/foo/lib/python3.8/site-packages/tsai/models/ROCKET.py in __init__(self, c_in, seq_len, n_kernels, kss, device)
97 kss = [ks for ks in kss if ks < seq_len]
98 convs = nn.ModuleList()
---> 99 for i in range(n_kernels):
100 ks = np.random.choice(kss)
101 dilation = 2**np.random.uniform(0, np.log2((seq_len - 1) // (ks - 1)))
TypeError: 'ROCKET' object cannot be interpreted as an integer
Wait - something must have not worked with regards to the update - it looks like it is still referring to the old function. Let me double check this.
Indeed, this is computing the features now. I can use bs=[8192, 16384]
successfully with approx 4GB memory utilization. 100_000 requires approx 12GB and takes approx 15 minutes for a batch - and thus approx 1 hour to create the rocket features but works great without any OOM now.
When trying to compute the rocket featres it fails for me with an CUDA out of memory error:
for ROCKET on a Nvidia P100.
The data is loaded using:
and I already try to reduce the number of kernels and kss. But somehow, this still fails and is not working / still running out of memory. This is even true when reducing the batch size to 8/16.
NOTICE: on disk the numpy array is approximately 70GB in size.
Maybe I am creating too many windows? Of an dataframe with approx 50 columns, 14. million records (panel data), 6GB in size according to pandas, I use:
to generate the sliding window of length 48 hours which is sliding over every 5 hours. Perhaps I should decrease the number of windows? But I find it strange that neither batch size or reduction of features helped to solve the problem.