Closed HariWu1995 closed 3 years ago
I'm sorry, but I'm not familiar with tensorflow.keras.preprocessing.timeseries_dataset_from_array. But I know this function is used to generate a batch at a time. I don't know what pipeline you have built, but I can try to explain what you need to create a new dataset in tsai
.
You need an X containing all the samples. If you have a single time series, you may need to use SlidingWindow to split it in subsequences. You may need to use y (unless unsupervised training). X need to be an np.ndarray (but np.memmap arrays are np.ndarrays). Lastly, you need to pass splits, to create the train and valid datasets.
I've tested this approach with larger than memory data, and it worked out well.
I'm having a similar issue. If I create an np.memmap or use something like the below from Zarr it seems the TSDatasets() is creating a copy in memory. I tried going through the class but I don't see where this is happening.
X = zarr.open('data/X_temp.zarr', mode='r')
y = zarr.open('data/y_temp.zarr', mode='r')
# Set up data loaders
tfms = [None, [TSRegression()]]
batch_tfms = [TSNormalize(by_var=True), TSStandardize(by_var=True)]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits, inplace=False)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=64, batch_tfms=batch_tfms)
I can do any numpy tasks with X and y without it being loaded into memory, but TSDatasets() for some reason is creating a memory copy. Not sure what the inplace argument does as it seems to be the same with either True or False.
I think the problem is here:
class NumpyDatasets(Datasets):
"A dataset that creates tuples from X (and y) and applies `tfms` of type item_tfms"
_xtype, _ytype = NumpyTensor, None # Expected X and y output types (must have a show method)
def __init__(self, X=None, y=None, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, inplace=True, **kwargs):
if X is not None and not isinstance(X, (np.ndarray, torch.Tensor)): X = np.asarray(X)
if y is not None and not isinstance(y, (np.ndarray, torch.Tensor)): y = np.asarray(y)
self.inplace = inplace
if tls is None:
X = itemify(X, tup_id=0)
y = itemify(y, tup_id=0) if y is not None else y
items = tuple((X,)) if y is None else tuple((X,y))
self.tfms = L(ifnone(tfms,[None]*len(ifnone(tls,items))))
self.tls = L(tls if tls else [TfmdLists(item, t, **kwargs) for item,t in zip(items,self.tfms)])
self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
if 'split' in kwargs: self.split_idxs = kwargs['split']
elif 'splits' in kwargs: self.split_idxs = kwargs['splits']
else: self.split_idxs = L(np.arange(len(self.tls[0]) if len(self.tls[0]) > 0 else len(self.tls)).tolist())
if len(self.tls[0]) > 0:
self.types = L([ifnone(_typ, type(tl[0]) if isinstance(tl[0], torch.Tensor) else tensor) for tl,_typ in zip(self.tls, [self._xtype, self._ytype])])
self.ptls = L([tl if not self.inplace else tl[:] if type(tl[0]).__name__ == 'memmap' else tensor(stack(tl[:])) for tl in self.tls])
Specifically with:
if X is not None and not isinstance(X, (np.ndarray, torch.Tensor)): X = np.asarray(X)
if y is not None and not isinstance(y, (np.ndarray, torch.Tensor)): y = np.asarray(y)
This is copying the array if it is not an np.ndarray or torch.Tensor
Update: I think this will be hard to solve as the Fastai Datasets class seems to convert things to np.ndarray as well. I think our best bet is to write a custom dataset and dataloader class.
Hi @TKassis, Thanks for adding more details to this issue. I believe there are at least 2 reasons why you may be getting OOM errors with large datasets.
I'd like to fix this issue to allow the use of larger than memory datasets (I intended to do it some time ago, and actually started working on it, but couldn't finalize the fix). Here's what I'd like to implement:
I'd like to extend support beyond np.ndarrays/ np.memmap to other array-like libraries (like zarr, xarray, dask.array, ...). I'll test this approach and see if it works.
Hi @TKassis,
I've further investigated this issue and have found a way to maintain the zarr array on disk until batch creation. However, I've found an issue that I don't know how to fix. I'm not familiar with zarr arrays. It thought they worked similarly to memmap arrays, but it seems they don't. The issue is that tsai
dataloaders create batches passing the entire indices. In the train dl these incides are shuffled. But it seems that zarr arrays can only take a single item or a slice (for a given region). Do you know if there's any way to slice a zarr array with a random items list (like[34, 5, 23. ...])?
If this can't be done, the solution would be to have an option to create the batch by applying indices one at a time. The result would then need to be concatenated. But this would be a slower option.
I believe you can what Zarr refers to as coordinate indexing. Something like this:
>>> z = zarr.array(np.arange(10))
>>> z[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> z.get_coordinate_selection([1, 4])
array([1, 4])
I was able to get Zarr to work with the following Pytorch dataloader and Pytorch Lightning trainer with your TST implementation (unchanged). The downside is it lacks a lot of the useful features you integrated into TSAI such as MVP:
import zarr
import torch
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import random_split, DataLoader
from torch.utils.data import Dataset
from helpers.custom_losses import SMAPELoss
from torchvision import transforms
from tsai.models import TST
from helpers.lds import prepare_weights
class MSDataset(Dataset):
def __init__(self, inputs_path, target_path, transform=None, target_transform=None ):
super().__init__()
self.inputs_path = inputs_path
self.target_path = target_path
self.X = zarr.open(self.inputs_path, mode='r')
self.y = zarr.open(self.target_path, mode='r')
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx].reshape(1,)
class TSTModel(pl.LightningModule):
def __init__(self, **kwargs):
super().__init__()
self.model = TST(**kwargs)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.l1_loss(y_hat, y)
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.l1_loss(y_hat, y)
self.log('val_loss', loss)
return loss
def test_step(self, batch, batch_idx):
return None
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == '__main__':
print('Creating datasets...')
data = MSDataset(inputs_path='data/X_temp.zarr', target_path=f'data/y.zarr')
print(f'Data size: {len(data)}')
train_set, val_set = torch.utils.data.random_split(data, [2500, 622])
train_dataloader = DataLoader(train_set, batch_size=64, shuffle=True)
val_dataloader = DataLoader(val_set, batch_size=64, shuffle=True)
model = TSTModel(c_in=50000, c_out=1, seq_len=300)
trainer = pl.Trainer(log_every_n_steps=25,
auto_lr_find=True,
gpus=1
)
trainer.fit(model, train_dataloader, val_dataloader)
Thanks for sharing this @TKassis.
I can see this solution preserves the data on disk in its original format until sliced. That's great.
The biggest complexity (and benefit) with fastai
comes with transforms. Transforms are necessary for example with some classification tasks where the input is of type str
.
I believe there's an option to use a similar approach in tsai
, maintaining the rest of the functionality.
I've already created a solution that I need to test. You would be able to choose if you want to apply transforms in place (when your data fits in memory) or on the fly (when data doesn't fit). The former would be a bit faster but requires more memory, while the latter would be useful for larger than memory datasets.
What I'm not sure is if I'll be able to complete this fix by tomorrow. Otherwise, I won't be able to work on it until 2 weeks from Monday.
Hi @TKassis, @HariWu1995, @xanderdunn,
First of all, I'm sorry for the delay in fixing this long-standing issue.
I've just made a major update to the core tsai
module. I've fixed some bugs that were preventing us from training with larger-than-memory datasets. I've tested the new functionality with some dummy data and it seems to work well. You can take a look at the notebook I've used for testing.
In addition to that, and based on @TKassis 's request, I've also extended the input formats that are supported. You should now be able to use the following:
The ones you can use on disk are the last 4.
In terms of performance, np.memmap, dask.array and xarray are pretty close. However, I've noticed that zarr is much slower. The issue seems to be the way in which the array is sliced. It'd get good if any of you have an idea how this could be improved.
If you test tsai
with large datasets, please, let me know how it works.
This is great! I'll test it out soon.
With regards to Zarr, yes I noticed that too. I think it depends how one initially chunks the Zarr array. For the purposes of randomly selecting samples of the data with the TSAI dataloader, users should chunk it as (1, vars, seq_len) when initially creating the array.
This is great! I'll test it out soon.
That'd be great!
For the purposes of randomly selecting samples of the data with the TSAI dataloader, users should chunk it as (1, vars, seq_len) when initially creating the array.
Thanks for pointing this out. I've run some additional tests and these are the results:
It looks like zarr arrays aare ever faster than memmap arrays. However, as to the chunking, the best alternative for randomly indexing samples seems to be (n_samples, 1, 1). You may want to double check this when you test the new functionality.
2 findings:
Oh, that's interesting. I just assumed (1, vars, seq_len) would be faster than (n_samples, 1, 1). I'll try it out and confirm. Looking at the data above though, am I mistaken to think that (1, vars, seq_len) is actually faster? (217 ms per loop vs 897 ms)?
It seems Zarr is at least 2x faster than memory-mapped numpy arrays. Just out of curiosity since you already have this set up, how do the above numbers compare to reading from a memory-loaded array? Also, are you using an SSD drive or a HDD?
I'm sorry @TKassis! I was wrong in my last post. I'll edit it. The results in the uploaded image are correct. The text isn't! You were completely right: (1, vars, seq_len) is faster. I cannot compare the numbers to data in memory as the array has 37GB, while RAM is only 12GB. I can only compare it to np.memmap, and indexing seems to be 7x faster (217ms vs 1.5s). I've used Google Colab Pro. I don't know the hardware details.
BTW, I've noticed there's an even faster way to index a zarr array if you set the chunks to (n_samples, n_vars, seq_len):
This is 30x faster than memmap arrays(57ms vs 1.61s). However, the issue is how to prepare such array with the expected data without taking too long. It'd be interesting to investigate if there's an easy way to do it. Training turns out to be about 6x faster (10 min per epoch for 1M samples).
Wow, this is great!
It's a bit strange that (n_samples, n_vars, seq_len) is faster than (1, n_vars, seq_len). I have a limited understanding of chunking, but I thought that using chunks that are similar to how the data will be read should provide faster read results.
Could it be that with (n_samples, n_vars, seq_len) considering there were 10 loops so total time is actually 10x57.2 ms and not 57.2 ms?
No, time is measured per loop.
Here are my results. I have an SSD drive and 64 GB of RAM:
I should note that one very practical advantage of storing an array as Zarr instead of Numpy .npy is that it is way smaller in size as Zarr uses compression. For example, a 187 GB array as .npy is only 5 GB with Zarr (at least the arrays I'm working with).
Looks great!! It confirms previous findings. This means that zar array with the proper chunks setting are close to having data in memory, and much faster than memmap arrays. (Obviously if data fits in memory that’s the preferred option.) ‘tsai’ is prepared to work with zarr arrays. The critical thing though is to able to prepare zarr arrays with the right chunks quickly. If we find a way to do this, this would be a great solution for large datasets.
Just FYI, I was trying to do a quick write speed test (assigning a numpy array to the Zarr array) and it seems writing all samples without chunking is not feasible for large arrays.
I've continued investigating how to speed up training and have found 2-3 things that help.
setting the dataloaders num_workers=#cpus available. A key advantage of zarr vs numpy arrays is that they have been designed for parallel computation. This means that they can be read and written in parallel by multiple workers. In my experience, setting num_workers > 0 tends to degrade performance when using numpy arrays or np.memmap, but it improves when using zarr arrays. Here's a training time benchmark:
One additional insight, is that with numpy arrays the GPU is at 0% most of the time (the cpu is the bottleneck) while with zarr arrays is almost always working at a 100% (cpu is no longer a bottleneck).
One additional benefit of zarr is compression. It not only reduces the file size, but increases the effective bandwidth.
So based on all this I believe the recommended approach to train large datasets at maximum speed would be the following:
tsai
to ensure the shuffled indices per batch are sorted. I'll try to make the change later today. So there's nothing you need to do to sort the indices.And that's it. It'd be interesting to compare the usual training time using np.memmap array or even an in-memory np.array.
It'd be good if some of you confirm these findings. If all this is confirmed, it means we'll be able to significantly improve training speed.
Incredible work! I can test this out in the next 2-3 days since I have Zarr dataset pipeline already. I'll compare training speeds with memmap array and memory-loaded array. Are these changes on the Master branch?
create a zarr array with chunks=(-1,1,1). In this way each sample will correspond to a chunk.
Just a small correction, you mean chunks=(1, -1, -1), correct?
This must be some type of dyslexia... 😅 Yes, you are right: 1, -1, -1. I've updated the message.
I'll compare training speeds with memmap array and memory-loaded array. Are these changes on the Master branch?
That'd be great!! Thanks @TKassis ! I'll make the change to the master branch later today. I'll let you know when it's uploaded.
I've just made the required changes to make zarr work efficiently available in the GitHub master branch.
(cc:@TKassis)
I have some new insights I'd like to share.
I've prepared a new tutorial nb on how to efficiently train large datasets with tsai using zarr arrays.
Best practices:
How to define n? A good rule of thumb seems to be this:
If you are interested here are the results for different shapes and batch sizes (auto == using chunks_calculator is almost as good as unchunked - single chunk):
shape: (1000, 10, 1000)
batch size: 64
memmap chunks: N/A : 0.000589
zarr_default chunks: (250, 3, 250) : 0.008723
zarr_1 chunks: (1, 10, 1000) : 0.015477
zarr_auto chunks: (1000, 10, 1000) : 0.000913
zarr_unchunked chunks: (1000, 10, 1000) : 0.000917
batch size: 512
memmap chunks: N/A : 0.020232
zarr_default chunks: (250, 3, 250) : 0.028686
zarr_1 chunks: (1, 10, 1000) : 0.137249
zarr_auto chunks: (1000, 10, 1000) : 0.017940
zarr_unchunked chunks: (1000, 10, 1000) : 0.017983
shape: (10000, 10, 1000)
batch size: 64
memmap chunks: N/A : 0.000882
zarr_default chunks: (1250, 2, 125) : 0.034200
zarr_1 chunks: (1, 10, 1000) : 0.015522
zarr_auto chunks: (10000, 10, 1000) : 0.001104
zarr_unchunked chunks: (10000, 10, 1000) : 0.001080
batch size: 512
memmap chunks: N/A : 0.020877
zarr_default chunks: (1250, 2, 125) : 0.059502
zarr_1 chunks: (1, 10, 1000) : 0.134743
zarr_auto chunks: (10000, 10, 1000) : 0.017995
zarr_unchunked chunks: (10000, 10, 1000) : 0.018066
shape: (100000, 10, 1000)
batch size: 64
memmap chunks: N/A : 0.100249
zarr_default chunks: (6250, 1, 125) : 0.146941
zarr_1 chunks: (1, 10, 1000) : 0.018990
zarr_auto chunks: (26757, 10, 1000) : 0.003746
zarr_unchunked chunks: (100000, 10, 1000) : 0.003110
batch size: 512
memmap chunks: N/A : 0.601841
zarr_default chunks: (6250, 1, 125) : 0.164207
zarr_1 chunks: (1, 10, 1000) : 0.138349
zarr_auto chunks: (26757, 10, 1000) : 0.018485
zarr_unchunked chunks: (100000, 10, 1000) : 0.017794
shape: (1000000, 10, 1000)
batch size: 64
memmap chunks: N/A : 0.211840
zarr_default chunks: (31250, 1, 63) : 0.460088
zarr_1 chunks: (1, 10, 1000) : 0.052718
zarr_auto chunks: (27028, 10, 1000) : 0.034241
batch size: 512
memmap chunks: N/A : 1.694527
zarr_default chunks: (31250, 1, 63) : 0.593565
zarr_1 chunks: (1, 10, 1000) : 0.173024
zarr_auto chunks: (27028, 10, 1000) : 0.053061
and here is the code in case you want to replicate these results:
for n in np.asarray([1e3, 1e4, 1e5, 1e6], dtype=int):
shape = (n, 10, 1000)
dtype = 'float32'
X_memmap = create_empty_array(shape, fname=f'X_{n}', mode='r+', dtype=dtype)
X_zarr_default = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype)
X_zarr_1 = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=(1, None, None))
X_zarr_auto = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=chunks_calculator(shape, dtype))
if n <= 100_000:
X_zarr_unchunked = zarr.open(path/f'X_{n}.zarr', mode='w', shape=shape, dtype=dtype, chunks=False)
else:
X_zarr_unchunked = None
print(f'\nshape: {shape}')
for bs in [64, 512]:
print(f' batch size: {bs}')
for name,arr in zip(['memmap', 'zarr_default', 'zarr_1', 'zarr_auto', 'zarr_unchunked'], [X_memmap, X_zarr_default, X_zarr_1, X_zarr_auto, X_zarr_unchunked]):
if arr is None: continue
if hasattr(arr, 'oindex'):
res = %timeit -o -q arr.oindex[np.sort(np.random.choice(len(arr), bs, False))]
else:
res = %timeit -o -q arr[np.sort(np.random.choice(len(arr), bs, False))]
print(f' {name:20} chunks: {str(arr.chunks) if hasattr(arr, "chunks") else "N/A":20}: {res.best:10.6f}')
This is great insight! Thanks for all you work on this! Also, sorry I haven't been able to run any tests from my end comparing training speeds as my GPU has been occupied with a hyperparameter tuning job the past two weeks, but will get to it soon.
sorry I haven't been able to run any tests from my end comparing training speeds as my GPU has been occupied with a hyperparameter tuning job the past two weeks, but will get to it soon.
No problem. It'd be great if you can measure training speed, but if you can't that is also fine. I'd never tried zarr arrays before, but I'm planning to start using them now. I'm really happy with their performance.
BTW, dnth and I are preparing a hyperparameter optimization notebook using Optuna that we'll release very soon. Do you have any experience with it?
I've never used Optuna unfortunately, I usually just use W&B Sweeps which is really easy to use and works really well with the WandbCallback() you have.
(I'm copying you, @TKassis , because you told me you might try this approach).
I got a new insight when using zarr arrays. Compression plays a key role.
In the examples I shared, all arrays were initialized with zeros. That's ok for memmap arrays, but since zarr arrays are compressed, when all data are zeros the compression ratio is huge. This makes the process the load data in memory incredibly fast (like shown in the tutorial nb). However, this is not the type of improvement we should expect with 'real data'.
Real data cannot have such a high compression ratio. This means it takes longer to load data into memory.
I still see a reduction in training time using large datasets (37GB). In the tutorial nb, I saw a reduction in total training time of about 65% with data equal to zeros. With random data (compression rate 1:1) the reduction is about 30%. So I still think it makes sense to try zarr arrays when using large datasets. Real datasets may be able to be somewhat compressed. I would expect performance improvement to fall between 30-65% compared to np.memmap. An easy way to know how much your data has been compressed is to print:
X_zarr.info
and look at 'Storage ratio'.
One of the things to take into account based on this new learning is that chunks should be set to (1, -1, -1) as originally proposed by @TKassis. So we should forget about the chunks_calculator. Sorry about the full workaround!
I'll update the tutorial nb later this week.
@oguiza thanks for the update! I'll try some real world data training with Zarr. I'll be out on vacation until next week, but can do it when I'm back.
@oguiza thanks for the update! I'll try some real world data training with Zarr. I'll be out on vacation until next week, but can do it when I'm back.
Ok, no worries. Enjoy your vacation!!
I'll close this issue since it's already fixed.
@TKassis, if you ever get a chance to test the updated functionality, I'd appreciate it if you would quickly let me know how it performs :)
@oguiza of course! Still on my to-do list! :-) Just got really busy with work after coming back from vacation. I'm installing an additional GPU on my work desktop next week so will be able to run a few tests in the background without interrupting my usual stuff.
Dear team,
I am in stuck when convert very large numpy array to your TSDatasets. These are what I have tried to fix my issue:
dsets = TSDatasets(X_dl, y_dl, splits=splits, tfms=transformations, inplace=True)
Is there any way to overcome this issue? Or I have to use your model only and design my own pipeline to use Tensorflow Data Generator