Custom Dataloader with sliding window for multiple Input files

HasnainKhanNiazi commented 1 year ago

First of all, I would like to say thanks for creating this amazing repo. I am working on a problem where I have multiple CSVs files and I need to read those multiple CSVs one by one with a sliding window. Let’s assume that, one CSV file is having 330 data points and the window size is 32 so we should be having (10*32 = 320) and the last 10 points will be discarded.

I started making a dataset that looks like this but after spending too much time, I am not able to get it working. The current code looks like this,

Note: I can’t merge all these CSV files into one.

I am getting this error, TypeError: list indices must be integers or slices, not list

I am following this gist to create the new dataset for training.

Till now, my code looks like this,

class CustomDataset(utils.Dataset):
      def __init__(self, data_folder, window_size, to_drop=None, targets=None):
          self.data_folder = data_folder
          self.window_size = window_size
          self.to_drop = to_drop
          self.targets = targets

  def __len__(self):
    return len(os.listdir(self.data_folder))

  def __getitem__(self, idx):
    X = []
    y = []

    data = pd.read_csv(os.path.join(self.data_folder, os.listdir(self.data_folder)[idx]))
    local_X = data.drop(self.to_drop + self.targets, axis=1).values
    local_y = data[self.targets].values if len(self.targets) == 1 else data[[self.targets]].values
    for index in range(0, len(local_X)):
        if len(local_X) - index >= self.window_size:
            X.append(local_X[index: index+self.window_size])
            y.append(local_y[index: index+self.window_size])

    return torch.tensor(X), torch.tensor(y)

I am calling dataset like this,

 train_dset = CustomDataset(data_folder="/training/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
 valid_dset = CustomDataset(data_folder="/validation/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
 dls = TSDataLoaders.from_dsets(train_dset, valid_dset, bs=[64, 128])
 dls.one_batch()

At dls.one_batch(), I am getting this error TypeError: list indices must be integers or slices, not list,

I am new in Pytorch so any help will be appreciated.

EDIT: To give some more information about the dataset, there is only one target and 71 input features in each csv ending up 72 columns having 71 Input features and one target.

HasnainKhanNiazi commented 1 year ago

I have been exploring the documentation and it looks like I need to return Tuple(TSTensor, Data). I did that too and it's not working for me still.

After exploring the documentation and different functions, I found SlidingWindow function. I used that but it is getting out of memory while loading.

window_length = 32
stride = 1
horizon = 1
Global_X = None
Global_y = None

path = "./training."
j = 0

for file in os.listdir(path):
    data = pd.read_csv(os.path.join(path, file))
    data = data.drop(["timestamp_local"], axis=1)
    X, y = SlidingWindow(window_length, stride=stride, start=0, horizon=horizon, get_x=data.columns[:-1], get_y="Y")(data)
    if j == 0:
        Global_X = X
        Global_y = y
    else:
        Global_X = np.concatenate([Global_X, X])
        Global_y = np.concatenate([Global_y, y])

    j += 1

I have around 60-70 CSV files and in every csv file, there are around 20,000 data points and with stride 1, I will be having 19968 windows with 32 timesteps and 70 features with this shape, Shape(19968, 32, 70) and for every CSV file, I will be having around 20,000 windows and if I concatenate 20,000 windows of every CSV then, surely the memory will get out of hands. At the end of concatenating all the windows I should be having around this size of array (1419153, 70, 32) Is there anything in TSAI or maybe in Numpy to concatenate huge arrays like this?

HasnainKhanNiazi commented 1 year ago

I solved the TypeError: list indices must be integers or slices, not list problem, as I was loading data from multiple files that's why idx in __getitem__(idx) was a list having the indices of the files from where the data needs to be loaded. How stupid of me to miss something so easy and straightforward. Anyway, now that problem is solved but I got into another problem. I changed the __getitem__() function to iterate over multiple files with the help of the indices from idx but now, I am running out of CUDA memory. RuntimeError: CUDA out of memory. Tried to allocate 306.00 MiB (GPU 0; 11.91 GiB total capacity; 10.93 GiB already allocated; 113.00 MiB free; 10.95 GiB reserved in total by PyTorch). Details of the system is given below along with code.

System Specs

Column	Value
os	Linux-4.15.0-194-generic-x86_64-with-glibc2.10
python	3.8.10
tsai	0.3.4
fastai	2.7.10
fastcore	1.5.7
zarr	2.13.3
torch	1.9.0+cu111
device	1 gpu (['TITAN Xp'])
cpu cores	8
threads per cpu	1
RAM	125.85 GB
GPU memory	[11.91] GB

Code

  class CustomDataset(utils.Dataset):
       def __init__(self, data_folder, window_size, to_drop=None, targets=None):
            self.data_folder = data_folder
            self.window_size = window_size
            self.to_drop = to_drop
            self.targets = targets

  def __len__(self):
      return len(os.listdir(self.data_folder))

  def __getitem__(self, idx):
      X = []
      y = []

      for file_index in idx:
          data = pd.read_csv(os.path.join(self.data_folder, os.listdir(self.data_folder)[file_index]))
          local_X = data.drop(self.to_drop + self.targets, axis=1).values
          local_y = data[self.targets].values if len(self.targets) == 1 else data[[self.targets]].values
          for index in range(0, len(local_X)):
              if len(local_X) - index >= self.window_size:
                  X.append(local_X[index: index+self.window_size])
                  y.append(local_y[index: index+self.window_size])

      return torch.Tensor(X), torch.Tensor(y)

I am making DataLoader like this.

    train_dset = CustomDataset(data_folder=".//training/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
    valid_dset = CustomDataset(data_folder="./validation/", window_size=32, to_drop=["timestamp_local"], targets=["Y"])
    dls = TSDataLoaders.from_dsets(train_dset, valid_dset, bs=[1, 1], device='cuda:0')

Cuda memory in running out when I try to train or find lr.

  learn = ts_learner(dls, mWDNPlus, metrics=[mae, rmse], cbs=ShowGraph())
  learn.lr_find()

amazonparrot commented 1 year ago

Did you try the approach in the following link? https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/11_How_to_train_big_arrays_faster_with_tsai.ipynb

I am about to have memory issue, I am planning to test zarr as suggested in the above link.

oguiza commented 1 year ago

Hi @HasnainKhanNiazi , The first thing you need to assess is whether all your data fits in memory or not. If it does, you may find TSMetaDatasets useful: https://github.com/timeseriesAI/tsai/blob/main/nbs/015_data.metadatasets.ipynb However, if they don't fit in memory you'll need to load data from disk. One way to do it is to transform data into a zarr array (stays on disk and is read lazily - you just load into memory the indices you pass) and use the standard TSDatasets. (https://colab.research.google.com/github/timeseriesAI/tsai/blob/master/tutorial_nbs/11_How_to_train_big_arrays_faster_with_tsai.ipynb)

HasnainKhanNiazi commented 1 year ago

Thanks @amazonparrot and @oguiza for your response. I have been using Metadatasets and it is working fine for me. I will close the issue now.

timeseriesAI / tsai

Custom Dataloader with sliding window for multiple Input files #627