Closed HasnainKhanNiazi closed 1 year ago
I have been exploring the documentation and it looks like I need to return Tuple(TSTensor
, Data). I did that too and it's not working for me still.
After exploring the documentation and different functions, I found SlidingWindow
function. I used that but it is getting out of memory while loading.
window_length = 32
stride = 1
horizon = 1
Global_X = None
Global_y = None
path = "./training."
j = 0
for file in os.listdir(path):
data = pd.read_csv(os.path.join(path, file))
data = data.drop(["timestamp_local"], axis=1)
X, y = SlidingWindow(window_length, stride=stride, start=0, horizon=horizon, get_x=data.columns[:-1], get_y="Y")(data)
if j == 0:
Global_X = X
Global_y = y
else:
Global_X = np.concatenate([Global_X, X])
Global_y = np.concatenate([Global_y, y])
j += 1
I have around 60-70 CSV files and in every csv file, there are around 20,000 data points and with stride 1, I will be having 19968
windows with 32
timesteps and 70
features with this shape, Shape(19968, 32, 70)
and for every CSV file, I will be having around 20,000
windows and if I concatenate 20,000
windows of every CSV then, surely the memory will get out of hands. At the end of concatenating all the windows I should be having around this size of array (1419153, 70, 32)
Is there anything in TSAI or maybe in Numpy to concatenate huge arrays like this?
I solved the TypeError: list indices must be integers or slices, not list
problem, as I was loading data from multiple files that's why idx
in __getitem__(idx)
was a list having the indices of the files from where the data needs to be loaded. How stupid of me to miss something so easy and straightforward. Anyway, now that problem is solved but I got into another problem. I changed the __getitem__()
function to iterate over multiple files with the help of the indices from idx
but now, I am running out of CUDA memory. RuntimeError: CUDA out of memory. Tried to allocate 306.00 MiB (GPU 0; 11.91 GiB total capacity; 10.93 GiB already allocated; 113.00 MiB free; 10.95 GiB reserved in total by PyTorch)
. Details of the system is given below along with code.
System Specs
Column | Value |
---|---|
os | Linux-4.15.0-194-generic-x86_64-with-glibc2.10 |
python | 3.8.10 |
tsai | 0.3.4 |
fastai | 2.7.10 |
fastcore | 1.5.7 |
zarr | 2.13.3 |
torch | 1.9.0+cu111 |
device | 1 gpu (['TITAN Xp']) |
cpu cores | 8 |
threads per cpu | 1 |
RAM | 125.85 GB |
GPU memory | [11.91] GB |
Code
class CustomDataset(utils.Dataset):
def __init__(self, data_folder, window_size, to_drop=None, targets=None):
self.data_folder = data_folder
self.window_size = window_size
self.to_drop = to_drop
self.targets = targets
def __len__(self):
return len(os.listdir(self.data_folder))
def __getitem__(self, idx):
X = []
y = []
for file_index in idx:
data = pd.read_csv(os.path.join(self.data_folder, os.listdir(self.data_folder)[file_index]))
local_X = data.drop(self.to_drop + self.targets, axis=1).values
local_y = data[self.targets].values if len(self.targets) == 1 else data[[self.targets]].values
for index in range(0, len(local_X)):
if len(local_X) - index >= self.window_size:
X.append(local_X[index: index+self.window_size])
y.append(local_y[index: index+self.window_size])
return torch.Tensor(X), torch.Tensor(y)
I am making DataLoader
like this.
train_dset = CustomDataset(data_folder=".//training/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
valid_dset = CustomDataset(data_folder="./validation/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
dls = TSDataLoaders.from_dsets(train_dset, valid_dset, bs=[1, 1], device='cuda:0')
Cuda memory in running out when I try to train or find lr.
learn = ts_learner(dls, mWDNPlus, metrics=[mae, rmse], cbs=ShowGraph())
learn.lr_find()
Did you try the approach in the following link? https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/11_How_to_train_big_arrays_faster_with_tsai.ipynb
I am about to have memory issue, I am planning to test zarr as suggested in the above link.
Hi @HasnainKhanNiazi , The first thing you need to assess is whether all your data fits in memory or not. If it does, you may find TSMetaDatasets useful: https://github.com/timeseriesAI/tsai/blob/main/nbs/015_data.metadatasets.ipynb However, if they don't fit in memory you'll need to load data from disk. One way to do it is to transform data into a zarr array (stays on disk and is read lazily - you just load into memory the indices you pass) and use the standard TSDatasets. (https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/11_How_to_train_big_arrays_faster_with_tsai.ipynb)
Thanks @amazonparrot and @oguiza for your response. I have been using Metadatasets and it is working fine for me. I will close the issue now.
First of all, I would like to say thanks for creating this amazing repo. I am working on a problem where I have multiple CSVs files and I need to read those multiple CSVs one by one with a sliding window. Let’s assume that, one CSV file is having 330 data points and the window size is 32 so we should be having (10*32 = 320) and the last 10 points will be discarded.
I started making a dataset that looks like this but after spending too much time, I am not able to get it working. The current code looks like this,
Note: I can’t merge all these CSV files into one.
I am getting this error, TypeError: list indices must be integers or slices, not list
I am following this gist to create the new dataset for training.
Till now, my code looks like this,
I am calling dataset like this,
At
dls.one_batch()
, I am getting this errorTypeError: list indices must be integers or slices, not list
,I am new in Pytorch so any help will be appreciated.
EDIT: To give some more information about the dataset, there is only one target and 71 input features in each csv ending up 72 columns having 71 Input features and one target.