timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.19k stars 647 forks source link

Memory Error with Very Large Dataset #126

Closed HariWu1995 closed 3 years ago

HariWu1995 commented 3 years ago

Dear team,

I am in stuck when convert very large numpy array to your TSDatasets. These are what I have tried to fix my issue:

Is there any way to overcome this issue? Or I have to use your model only and design my own pipeline to use Tensorflow Data Generator

oguiza commented 3 years ago

I'm sorry, but I'm not familiar with tensorflow.keras.preprocessing.timeseries_dataset_from_array. But I know this function is used to generate a batch at a time. I don't know what pipeline you have built, but I can try to explain what you need to create a new dataset in tsai. You need an X containing all the samples. If you have a single time series, you may need to use SlidingWindow to split it in subsequences. You may need to use y (unless unsupervised training). X need to be an np.ndarray (but np.memmap arrays are np.ndarrays). Lastly, you need to pass splits, to create the train and valid datasets. I've tested this approach with larger than memory data, and it worked out well.

TKassis commented 3 years ago

I'm having a similar issue. If I create an np.memmap or use something like the below from Zarr it seems the TSDatasets() is creating a copy in memory. I tried going through the class but I don't see where this is happening.

X = zarr.open('data/X_temp.zarr', mode='r')
y = zarr.open('data/y_temp.zarr', mode='r')

# Set up data loaders
tfms  = [None, [TSRegression()]]
batch_tfms = [TSNormalize(by_var=True), TSStandardize(by_var=True)]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits, inplace=False)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=64, batch_tfms=batch_tfms)

I can do any numpy tasks with X and y without it being loaded into memory, but TSDatasets() for some reason is creating a memory copy. Not sure what the inplace argument does as it seems to be the same with either True or False.

TKassis commented 3 years ago

I think the problem is here:

class NumpyDatasets(Datasets):
    "A dataset that creates tuples from X (and y) and applies `tfms` of type item_tfms"
    _xtype, _ytype = NumpyTensor, None # Expected X and y output types (must have a show method)
    def __init__(self, X=None, y=None, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, inplace=True, **kwargs):
        if X is not None and not isinstance(X, (np.ndarray, torch.Tensor)): X = np.asarray(X)
        if y is not None and not isinstance(y, (np.ndarray, torch.Tensor)): y = np.asarray(y)
        self.inplace = inplace
        if tls is None:
            X = itemify(X, tup_id=0)
            y = itemify(y, tup_id=0) if y is not None else y
            items = tuple((X,)) if y is None else tuple((X,y))
            self.tfms = L(ifnone(tfms,[None]*len(ifnone(tls,items))))
        self.tls = L(tls if tls else [TfmdLists(item, t, **kwargs) for item,t in zip(items,self.tfms)])
        self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
        if 'split' in kwargs: self.split_idxs = kwargs['split']
        elif 'splits' in kwargs:  self.split_idxs = kwargs['splits']
        else: self.split_idxs = L(np.arange(len(self.tls[0]) if len(self.tls[0]) > 0 else len(self.tls)).tolist())
        if len(self.tls[0]) > 0:
            self.types = L([ifnone(_typ, type(tl[0]) if isinstance(tl[0], torch.Tensor) else tensor) for tl,_typ in zip(self.tls, [self._xtype, self._ytype])])
            self.ptls = L([tl if not self.inplace else tl[:] if type(tl[0]).__name__ == 'memmap' else tensor(stack(tl[:])) for tl in self.tls])

Specifically with:

        if X is not None and not isinstance(X, (np.ndarray, torch.Tensor)): X = np.asarray(X)
        if y is not None and not isinstance(y, (np.ndarray, torch.Tensor)): y = np.asarray(y)

This is copying the array if it is not an np.ndarray or torch.Tensor

Update: I think this will be hard to solve as the Fastai Datasets class seems to convert things to np.ndarray as well. I think our best bet is to write a custom dataset and dataloader class.

oguiza commented 3 years ago

Hi @TKassis, Thanks for adding more details to this issue. I believe there are at least 2 reasons why you may be getting OOM errors with large datasets.

  1. The input is always converted into np.ndarrays. This means that on-disk data is brought into memory, creating the issue.
  2. Even if the data structure was preserved, when you pass splits and these are applied, you bring data into memory too.

I'd like to fix this issue to allow the use of larger than memory datasets (I intended to do it some time ago, and actually started working on it, but couldn't finalize the fix). Here's what I'd like to implement:

  1. An updated custom dataset and dataloader class that preserves the input type until it's sliced to generate the batch.
  2. Avoid splitting the data before the batch is created. This would be done by calculating the slice relative to the original input (using the split for training, validation, etc).

I'd like to extend support beyond np.ndarrays/ np.memmap to other array-like libraries (like zarr, xarray, dask.array, ...). I'll test this approach and see if it works.

oguiza commented 3 years ago

Hi @TKassis, I've further investigated this issue and have found a way to maintain the zarr array on disk until batch creation. However, I've found an issue that I don't know how to fix. I'm not familiar with zarr arrays. It thought they worked similarly to memmap arrays, but it seems they don't. The issue is that tsai dataloaders create batches passing the entire indices. In the train dl these incides are shuffled. But it seems that zarr arrays can only take a single item or a slice (for a given region). Do you know if there's any way to slice a zarr array with a random items list (like[34, 5, 23. ...])? If this can't be done, the solution would be to have an option to create the batch by applying indices one at a time. The result would then need to be concatenated. But this would be a slower option.

TKassis commented 3 years ago

I believe you can what Zarr refers to as coordinate indexing. Something like this:

>>> z = zarr.array(np.arange(10))
>>> z[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> z.get_coordinate_selection([1, 4])
array([1, 4])
TKassis commented 3 years ago

I was able to get Zarr to work with the following Pytorch dataloader and Pytorch Lightning trainer with your TST implementation (unchanged). The downside is it lacks a lot of the useful features you integrated into TSAI such as MVP:

import zarr
import torch
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import random_split, DataLoader
from torch.utils.data import Dataset
from helpers.custom_losses import SMAPELoss
from torchvision import transforms
from tsai.models import TST
from helpers.lds import prepare_weights

class MSDataset(Dataset):
    def __init__(self, inputs_path, target_path, transform=None, target_transform=None ):
        super().__init__()
        self.inputs_path = inputs_path
        self.target_path = target_path
        self.X = zarr.open(self.inputs_path, mode='r')
        self.y = zarr.open(self.target_path, mode='r')
        self.transform = transform

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx].reshape(1,)

class TSTModel(pl.LightningModule):
    def __init__(self, **kwargs):
        super().__init__()
        self.model = TST(**kwargs)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = F.l1_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = F.l1_loss(y_hat, y)
        self.log('val_loss', loss)
        return loss

    def test_step(self, batch, batch_idx):
        return None

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

if __name__ == '__main__':

    print('Creating datasets...')
    data = MSDataset(inputs_path='data/X_temp.zarr', target_path=f'data/y.zarr')
    print(f'Data size: {len(data)}')

    train_set, val_set = torch.utils.data.random_split(data, [2500, 622])

    train_dataloader = DataLoader(train_set, batch_size=64, shuffle=True)
    val_dataloader = DataLoader(val_set, batch_size=64, shuffle=True)

    model = TSTModel(c_in=50000, c_out=1, seq_len=300)

    trainer = pl.Trainer(log_every_n_steps=25,
                        auto_lr_find=True,
                        gpus=1
                        )

    trainer.fit(model, train_dataloader, val_dataloader)
oguiza commented 3 years ago

Thanks for sharing this @TKassis. I can see this solution preserves the data on disk in its original format until sliced. That's great. The biggest complexity (and benefit) with fastai comes with transforms. Transforms are necessary for example with some classification tasks where the input is of type str. I believe there's an option to use a similar approach in tsai, maintaining the rest of the functionality. I've already created a solution that I need to test. You would be able to choose if you want to apply transforms in place (when your data fits in memory) or on the fly (when data doesn't fit). The former would be a bit faster but requires more memory, while the latter would be useful for larger than memory datasets. What I'm not sure is if I'll be able to complete this fix by tomorrow. Otherwise, I won't be able to work on it until 2 weeks from Monday.

oguiza commented 3 years ago

Hi @TKassis, @HariWu1995, @xanderdunn,

First of all, I'm sorry for the delay in fixing this long-standing issue. I've just made a major update to the core tsai module. I've fixed some bugs that were preventing us from training with larger-than-memory datasets. I've tested the new functionality with some dummy data and it seems to work well. You can take a look at the notebook I've used for testing.

In addition to that, and based on @TKassis 's request, I've also extended the input formats that are supported. You should now be able to use the following:

The ones you can use on disk are the last 4.

In terms of performance, np.memmap, dask.array and xarray are pretty close. However, I've noticed that zarr is much slower. The issue seems to be the way in which the array is sliced. It'd get good if any of you have an idea how this could be improved.

If you test tsai with large datasets, please, let me know how it works.

TKassis commented 3 years ago

This is great! I'll test it out soon.

With regards to Zarr, yes I noticed that too. I think it depends how one initially chunks the Zarr array. For the purposes of randomly selecting samples of the data with the TSAI dataloader, users should chunk it as (1, vars, seq_len) when initially creating the array.

oguiza commented 3 years ago

This is great! I'll test it out soon.

That'd be great!

For the purposes of randomly selecting samples of the data with the TSAI dataloader, users should chunk it as (1, vars, seq_len) when initially creating the array.

Thanks for pointing this out. I've run some additional tests and these are the results: Screen Shot 2021-08-29 at 12 57 28

It looks like zarr arrays aare ever faster than memmap arrays. However, as to the chunking, the best alternative for randomly indexing samples seems to be (n_samples, 1, 1). You may want to double check this when you test the new functionality. 2 findings:

  1. Indexing a large zarr array on disk seems to be faster than indexing a np.memmap array of the same size.
  2. There are multiple ways to define chunks in a zarr array. chunks=(1, vars, seq_len) seems to be the fastest one (as @TKassis mentioned in his previous message).
TKassis commented 3 years ago

Oh, that's interesting. I just assumed (1, vars, seq_len) would be faster than (n_samples, 1, 1). I'll try it out and confirm. Looking at the data above though, am I mistaken to think that (1, vars, seq_len) is actually faster? (217 ms per loop vs 897 ms)?

It seems Zarr is at least 2x faster than memory-mapped numpy arrays. Just out of curiosity since you already have this set up, how do the above numbers compare to reading from a memory-loaded array? Also, are you using an SSD drive or a HDD?

oguiza commented 3 years ago

I'm sorry @TKassis! I was wrong in my last post. I'll edit it. The results in the uploaded image are correct. The text isn't! You were completely right: (1, vars, seq_len) is faster. I cannot compare the numbers to data in memory as the array has 37GB, while RAM is only 12GB. I can only compare it to np.memmap, and indexing seems to be 7x faster (217ms vs 1.5s). I've used Google Colab Pro. I don't know the hardware details.

oguiza commented 3 years ago

BTW, I've noticed there's an even faster way to index a zarr array if you set the chunks to (n_samples, n_vars, seq_len):

Screen Shot 2021-08-29 at 21 33 44

This is 30x faster than memmap arrays(57ms vs 1.61s). However, the issue is how to prepare such array with the expected data without taking too long. It'd be interesting to investigate if there's an easy way to do it. Training turns out to be about 6x faster (10 min per epoch for 1M samples).

TKassis commented 3 years ago

Wow, this is great!

It's a bit strange that (n_samples, n_vars, seq_len) is faster than (1, n_vars, seq_len). I have a limited understanding of chunking, but I thought that using chunks that are similar to how the data will be read should provide faster read results.

TKassis commented 3 years ago

Could it be that with (n_samples, n_vars, seq_len) considering there were 10 loops so total time is actually 10x57.2 ms and not 57.2 ms?

oguiza commented 3 years ago

No, time is measured per loop.

TKassis commented 3 years ago

Here are my results. I have an SSD drive and 64 GB of RAM:

results

I should note that one very practical advantage of storing an array as Zarr instead of Numpy .npy is that it is way smaller in size as Zarr uses compression. For example, a 187 GB array as .npy is only 5 GB with Zarr (at least the arrays I'm working with).

oguiza commented 3 years ago

Looks great!! It confirms previous findings. This means that zar array with the proper chunks setting are close to having data in memory, and much faster than memmap arrays. (Obviously if data fits in memory that’s the preferred option.) ‘tsai’ is prepared to work with zarr arrays. The critical thing though is to able to prepare zarr arrays with the right chunks quickly. If we find a way to do this, this would be a great solution for large datasets.

TKassis commented 3 years ago

Just FYI, I was trying to do a quick write speed test (assigning a numpy array to the Zarr array) and it seems writing all samples without chunking is not feasible for large arrays.

results

oguiza commented 3 years ago

I've continued investigating how to speed up training and have found 2-3 things that help.

  1. sorting the indices. While sorting the indices doesn't help with numpy arrays, it helps a lot when using Zarr arrays. This is due to the fact that when sorted, many indices will fall in the same chunk. This makes searching faster. Here are some new tests I've run:

Screen Shot 2021-08-30 at 15 37 20

  1. setting the dataloaders num_workers=#cpus available. A key advantage of zarr vs numpy arrays is that they have been designed for parallel computation. This means that they can be read and written in parallel by multiple workers. In my experience, setting num_workers > 0 tends to degrade performance when using numpy arrays or np.memmap, but it improves when using zarr arrays. Here's a training time benchmark:

    • numpy with num_workers=0: 59 min/epoch
    • zarr (chunk=(1,-1,-1)) with num_workers=0: 18 min/epoch
    • zarr (chunk=(1,-1,-1)) with num_workers=2: 15 min/epoch
    • zarr (chunk=(-1,-1,-1)) with num_workers=0: 15 min/epoch
    • zarr (chunk=(-1,-1,-1)) with num_workers=0: 15 min/epoch

    One additional insight, is that with numpy arrays the GPU is at 0% most of the time (the cpu is the bottleneck) while with zarr arrays is almost always working at a 100% (cpu is no longer a bottleneck).

One additional benefit of zarr is compression. It not only reduces the file size, but increases the effective bandwidth.

So based on all this I believe the recommended approach to train large datasets at maximum speed would be the following:

And that's it. It'd be interesting to compare the usual training time using np.memmap array or even an in-memory np.array.

It'd be good if some of you confirm these findings. If all this is confirmed, it means we'll be able to significantly improve training speed.

TKassis commented 3 years ago

Incredible work! I can test this out in the next 2-3 days since I have Zarr dataset pipeline already. I'll compare training speeds with memmap array and memory-loaded array. Are these changes on the Master branch?

TKassis commented 3 years ago

create a zarr array with chunks=(-1,1,1). In this way each sample will correspond to a chunk.

Just a small correction, you mean chunks=(1, -1, -1), correct?

oguiza commented 3 years ago

This must be some type of dyslexia... 😅 Yes, you are right: 1, -1, -1. I've updated the message.

I'll compare training speeds with memmap array and memory-loaded array. Are these changes on the Master branch?

That'd be great!! Thanks @TKassis ! I'll make the change to the master branch later today. I'll let you know when it's uploaded.

oguiza commented 3 years ago

I've just made the required changes to make zarr work efficiently available in the GitHub master branch.

oguiza commented 3 years ago

(cc:@TKassis)

I have some new insights I'd like to share.

I've prepared a new tutorial nb on how to efficiently train large datasets with tsai using zarr arrays.

Best practices:

How to define n? A good rule of thumb seems to be this:

If you are interested here are the results for different shapes and batch sizes (auto == using chunks_calculator is almost as good as unchunked - single chunk):

shape: (1000, 10, 1000)
    batch size: 64
        memmap               chunks: N/A                 :   0.000589
        zarr_default         chunks: (250, 3, 250)       :   0.008723
        zarr_1               chunks: (1, 10, 1000)       :   0.015477
        zarr_auto            chunks: (1000, 10, 1000)    :   0.000913
        zarr_unchunked       chunks: (1000, 10, 1000)    :   0.000917
    batch size: 512
        memmap               chunks: N/A                 :   0.020232
        zarr_default         chunks: (250, 3, 250)       :   0.028686
        zarr_1               chunks: (1, 10, 1000)       :   0.137249
        zarr_auto            chunks: (1000, 10, 1000)    :   0.017940
        zarr_unchunked       chunks: (1000, 10, 1000)    :   0.017983

shape: (10000, 10, 1000)
    batch size: 64
        memmap               chunks: N/A                 :   0.000882
        zarr_default         chunks: (1250, 2, 125)      :   0.034200
        zarr_1               chunks: (1, 10, 1000)       :   0.015522
        zarr_auto            chunks: (10000, 10, 1000)   :   0.001104
        zarr_unchunked       chunks: (10000, 10, 1000)   :   0.001080
    batch size: 512
        memmap               chunks: N/A                 :   0.020877
        zarr_default         chunks: (1250, 2, 125)      :   0.059502
        zarr_1               chunks: (1, 10, 1000)       :   0.134743
        zarr_auto            chunks: (10000, 10, 1000)   :   0.017995
        zarr_unchunked       chunks: (10000, 10, 1000)   :   0.018066

shape: (100000, 10, 1000)
    batch size: 64
        memmap               chunks: N/A                 :   0.100249
        zarr_default         chunks: (6250, 1, 125)      :   0.146941
        zarr_1               chunks: (1, 10, 1000)       :   0.018990
        zarr_auto            chunks: (26757, 10, 1000)   :   0.003746
        zarr_unchunked       chunks: (100000, 10, 1000)  :   0.003110
    batch size: 512
        memmap               chunks: N/A                 :   0.601841
        zarr_default         chunks: (6250, 1, 125)      :   0.164207
        zarr_1               chunks: (1, 10, 1000)       :   0.138349
        zarr_auto            chunks: (26757, 10, 1000)   :   0.018485
        zarr_unchunked       chunks: (100000, 10, 1000)  :   0.017794

shape: (1000000, 10, 1000)
    batch size: 64
        memmap               chunks: N/A                 :   0.211840
        zarr_default         chunks: (31250, 1, 63)      :   0.460088
        zarr_1               chunks: (1, 10, 1000)       :   0.052718
        zarr_auto            chunks: (27028, 10, 1000)   :   0.034241
    batch size: 512
        memmap               chunks: N/A                 :   1.694527
        zarr_default         chunks: (31250, 1, 63)      :   0.593565
        zarr_1               chunks: (1, 10, 1000)       :   0.173024
        zarr_auto            chunks: (27028, 10, 1000)   :   0.053061

and here is the code in case you want to replicate these results:

for n in np.asarray([1e3, 1e4, 1e5, 1e6], dtype=int):
    shape = (n, 10, 1000)
    dtype = 'float32'
    X_memmap = create_empty_array(shape, fname=f'X_{n}', mode='r+', dtype=dtype)
    X_zarr_default = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype)
    X_zarr_1 = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=(1, None, None))
    X_zarr_auto = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=chunks_calculator(shape, dtype))
    if n <= 100_000:
        X_zarr_unchunked = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=False)
    else: 
        X_zarr_unchunked = None

    print(f'\nshape: {shape}')
    for bs in [64, 512]:
        print(f'    batch size: {bs}')
        for name,arr in zip(['memmap', 'zarr_default', 'zarr_1', 'zarr_auto', 'zarr_unchunked'], [X_memmap, X_zarr_default, X_zarr_1, X_zarr_auto, X_zarr_unchunked]):
            if arr is None: continue
            if hasattr(arr, 'oindex'):
                res = %timeit -o -q arr.oindex[np.sort(np.random.choice(len(arr), bs, False))]
            else: 
                res = %timeit -o -q arr[np.sort(np.random.choice(len(arr), bs, False))]
            print(f'        {name:20} chunks: {str(arr.chunks) if hasattr(arr, "chunks") else "N/A":20}: {res.best:10.6f}')
TKassis commented 3 years ago

This is great insight! Thanks for all you work on this! Also, sorry I haven't been able to run any tests from my end comparing training speeds as my GPU has been occupied with a hyperparameter tuning job the past two weeks, but will get to it soon.

oguiza commented 3 years ago

sorry I haven't been able to run any tests from my end comparing training speeds as my GPU has been occupied with a hyperparameter tuning job the past two weeks, but will get to it soon.

No problem. It'd be great if you can measure training speed, but if you can't that is also fine. I'd never tried zarr arrays before, but I'm planning to start using them now. I'm really happy with their performance.

BTW, dnth and I are preparing a hyperparameter optimization notebook using Optuna that we'll release very soon. Do you have any experience with it?

TKassis commented 3 years ago

I've never used Optuna unfortunately, I usually just use W&B Sweeps which is really easy to use and works really well with the WandbCallback() you have.

oguiza commented 3 years ago

(I'm copying you, @TKassis , because you told me you might try this approach).

I got a new insight when using zarr arrays. Compression plays a key role.

In the examples I shared, all arrays were initialized with zeros. That's ok for memmap arrays, but since zarr arrays are compressed, when all data are zeros the compression ratio is huge. This makes the process the load data in memory incredibly fast (like shown in the tutorial nb). However, this is not the type of improvement we should expect with 'real data'.

Real data cannot have such a high compression ratio. This means it takes longer to load data into memory.

I still see a reduction in training time using large datasets (37GB). In the tutorial nb, I saw a reduction in total training time of about 65% with data equal to zeros. With random data (compression rate 1:1) the reduction is about 30%. So I still think it makes sense to try zarr arrays when using large datasets. Real datasets may be able to be somewhat compressed. I would expect performance improvement to fall between 30-65% compared to np.memmap. An easy way to know how much your data has been compressed is to print:

X_zarr.info

and look at 'Storage ratio'.

One of the things to take into account based on this new learning is that chunks should be set to (1, -1, -1) as originally proposed by @TKassis. So we should forget about the chunks_calculator. Sorry about the full workaround!

I'll update the tutorial nb later this week.

TKassis commented 3 years ago

@oguiza thanks for the update! I'll try some real world data training with Zarr. I'll be out on vacation until next week, but can do it when I'm back.

oguiza commented 3 years ago

@oguiza thanks for the update! I'll try some real world data training with Zarr. I'll be out on vacation until next week, but can do it when I'm back.

Ok, no worries. Enjoy your vacation!!

oguiza commented 3 years ago

I'll close this issue since it's already fixed.

@TKassis, if you ever get a chance to test the updated functionality, I'd appreciate it if you would quickly let me know how it performs :)

TKassis commented 3 years ago

@oguiza of course! Still on my to-do list! :-) Just got really busy with work after coming back from vacation. I'm installing an additional GPU on my work desktop next week so will be able to run a few tests in the background without interrupting my usual stuff.