SimCLR model.fit() does not start training with custom dataset loading from RAM

schelmi1 commented 6 months ago

Hi,

i'am playing around with the SimCLR tutorial and trying to train on a custom dataset class using LightlyDataset.from_torch_dataset(). Using the MNIST handwritten digits dataset with LightlyDataset(input_dir=path_to_images) everything works fine and it starts training and finishes without any issues.

With my custom pytorch dataset class im loading all the files and labels from a zip file to ram, then using

dataset_lightly = CustomImageDataset(dataset) 
dataset_test = CustomImageDataset(dataset) 
dataset_lightly.transform = transform
dataset_test.transform = test_transform
dataset_train_simclr_custom = LightlyDataset.from_torch_dataset(dataset_lightly)
dataset_test_custom = LightlyDataset.from_torch_dataset(dataset_test)

if i compare dataset_train_simclr_custom.dataset.__getitem__(1)

([tensor([[[-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           ...,
           [-2.1179, -2.1179, -2.1008,  ..., -2.1179, -2.0665, -2.0323],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1008],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179]],

          [[-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           ...,
           [-2.0357, -2.0357, -2.0182,  ..., -2.0357, -1.9832, -1.9482],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0182],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357]],

          [[-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           ...,
           [-1.8044, -1.8044, -1.7870,  ..., -1.8044, -1.7522, -1.7173],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.7870],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044]]]),
  tensor([[[-2.0665, -2.0665, -2.0665,  ..., -2.0494, -2.0665, -2.0665],
           [-1.9638, -2.0152, -2.0665,  ..., -2.0494, -2.0665, -2.0665],
           [-1.9809, -2.0152, -2.0665,  ..., -2.0665, -2.0665, -2.0665],
           ...,
           [-2.0665, -2.0665, -2.0665,  ..., -2.0665, -2.0665, -2.0665],
           [-2.0665, -2.0665, -2.0665,  ..., -2.0665, -2.0665, -2.0665],
           [-2.0665, -2.0665, -2.0665,  ..., -2.0665, -2.0665, -2.0665]],

          [[-1.9832, -1.9832, -1.9832,  ..., -1.9657, -1.9832, -1.9832],
           [-1.8782, -1.9307, -1.9832,  ..., -1.9657, -1.9832, -1.9832],
           [-1.8957, -1.9307, -1.9832,  ..., -1.9832, -1.9832, -1.9832],
           ...,
           [-1.9832, -1.9832, -1.9832,  ..., -1.9832, -1.9832, -1.9832],
           [-1.9832, -1.9832, -1.9832,  ..., -1.9832, -1.9832, -1.9832],
           [-1.9832, -1.9832, -1.9832,  ..., -1.9832, -1.9832, -1.9832]],

          [[-1.7522, -1.7522, -1.7522,  ..., -1.7347, -1.7522, -1.7522],
           [-1.6476, -1.6999, -1.7522,  ..., -1.7347, -1.7522, -1.7522],
           [-1.6650, -1.6999, -1.7522,  ..., -1.7522, -1.7522, -1.7522],
           ...,
           [-1.7522, -1.7522, -1.7522,  ..., -1.7522, -1.7522, -1.7522],
           [-1.7522, -1.7522, -1.7522,  ..., -1.7522, -1.7522, -1.7522],
           [-1.7522, -1.7522, -1.7522,  ..., -1.7522, -1.7522, -1.7522]]])],
 0)

and dataset_train_simclr.dataset.__getitem__(1), it looks the same:

([tensor([[[-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.0494,  ..., -2.1008, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1008,  ..., -2.1008, -2.1179, -2.1179],
           ...,
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179]],

          [[-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -1.9657,  ..., -2.0182, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0182,  ..., -2.0182, -2.0357, -2.0357],
           ...,
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357]],

          [[-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.7347,  ..., -1.7870, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.7870,  ..., -1.7870, -1.8044, -1.8044],
           ...,
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044]]]),
  tensor([[[-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179],
           ...,
           [-2.1179, -2.1179, -2.1179,  ..., -0.2856, -0.8164, -1.3644],
           [-2.1179, -2.1179, -2.1179,  ..., -2.0152, -2.0152, -2.0665],
           [-2.1179, -2.1179, -2.1179,  ..., -2.1179, -2.1179, -2.1179]],

          [[-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357],
           ...,
           [-2.0357, -2.0357, -2.0357,  ..., -0.1625, -0.7052, -1.2654],
           [-2.0357, -2.0357, -2.0357,  ..., -1.9307, -1.9307, -1.9832],
           [-2.0357, -2.0357, -2.0357,  ..., -2.0357, -2.0357, -2.0357]],

          [[-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044],
           ...,
           [-1.8044, -1.8044, -1.8044,  ...,  0.0605, -0.4798, -1.0376],
           [-1.8044, -1.8044, -1.8044,  ..., -1.6999, -1.6999, -1.7522],
           [-1.8044, -1.8044, -1.8044,  ..., -1.8044, -1.8044, -1.8044]]])],
 0)

i put datasets into the dataloader like in the tutorial

dataloader_train_simclr = torch.utils.data.DataLoader(
    dataset_train_simclr_custom,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)

However with the custom dataset it never starts training after calling trainer.fit(model, dataloader_train_simclr) in contrast to the dataset created from just passing the folder path

RAM is never full, i already tried way less images in the dataset aswell

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\Schelli\miniconda3\envs\lightly\lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA GeForce RTX 4060 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name            | Type                 | Params
---------------------------------------------------------
0 | backbone        | Sequential           | 11.2 M
1 | projection_head | SimCLRProjectionHead | 328 K 
2 | criterion       | NTXentLoss           | 0     
---------------------------------------------------------
11.5 M    Trainable params
0         Non-trainable params
11.5 M    Total params
46.022    Total estimated model params size (MB)
C:\Users\Schelli\miniconda3\envs\lightly\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:436: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
C:\Users\Schelli\miniconda3\envs\lightly\lib\site-packages\pytorch_lightning\loops\fit_loop.py:298: The number of training batches (21) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

my custom dataset class looks like this:

class CustomImageDataset(Dataset):
    def __init__(self, dataset, img_dir=None, transform=None, target_transform=None):
        self.images = []
        self.img_labels = []
        print(len(dataset))
        for k, v in dataset.items():
            for img in v:
                self.images.append(img)
                self.img_labels.append(k)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        image = Image.fromarray(self.images[idx])
        label = self.img_labels[idx]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

any hint is appreciated

guarin commented 6 months ago

Hi and thanks for using Lightly!

I guess this is not explained well in the docs but you don't have to use a LightlyDataset, you can use your CustomImageDataset instead. Just make sure to set the transforms :)

schelmi1 commented 6 months ago

thank you, so it seems to be a pytorch_lightning problem then as it does not work with the CustomImageDataset aswell

guarin commented 6 months ago

Maybe try setting num_workers=0 to see if it works when the dataset is only in the main process. If this is the case it might be an issue related to starting the dataloaders as they have to copy the dataset to every worker process. You should be able to test this even outside pytorch lightning with:

dataset = CustomImageDataset(...)
dataloader = DataLoader(dataset)
for batch in dataloader:
     print("got batch")

schelmi1 commented 6 months ago

finally got it! im running it from command line as .py-file out of if __name__ == "__main__" works with an arbitrary realistic number of workers and persistent_workers=True in dataloader args

its also extremely fast (compared to your tutorial with loading from disk, probably because its loading directly from RAM!?)

guarin commented 6 months ago

Great that you got it working! I'll close the issue for now.

its also extremely fast (compared to your tutorial with loading from disk, probably because its loading directly from RAM!?)

Yes, loading from RAM is usually much faster than from disk.

lightly-ai / lightly

SimCLR model.fit() does not start training with custom dataset loading from RAM #1518