Loader stuck indefinitely in constructor

bsmietanka commented 2 years ago

Hello, I tried to run a simple training script using FFCV loader by adapting an example from your README (script attached below). I first converted my custom dataset. During execution, the program hangs indefinitely while consuming more and more memory and occupying one CPU core to 100%. I run the script inside a Docker container built using Dockerfile provided in this repository. When I interrupt the program, it looks like it's stuck in a loop in MemoryManager base class.

Error

File "/workspace/train_ffcv.py", line 35, in <module>
    loader = Loader(write_path, batch_size=bs, num_workers=num_workers,
File "/opt/conda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/loader.py", line 146, in __init__
    self.memory_manager: MemoryManager = ProcessCacheManager(
File "/opt/conda/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_managers/base.py", line 62, in __init__
    page_to_samples[pid].add(sid)
KeyboardInterrupt

Code used:

import numpy as np
import torch
from tqdm import tqdm
from torchvision.models import resnet18

from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Cutout, NormalizeImage
from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder

epochs = 20
write_path = "data/ffcv/helmets.beton"
bs = 512
num_workers = 6

decoder = RandomResizedCropRGBImageDecoder((128, 128), (0.6, 1.))
normalizer = NormalizeImage(np.array([0.485, 0.456, 0.406]), np.array([0.229, 0.224, 0.225]), np.float32)

image_pipeline = [decoder, ToTensor(), ToTorchImage(), normalizer, ToDevice(0)]
label_pipeline = [IntDecoder(), ToTensor(), ToDevice(0)]

pipelines = {
    'image': image_pipeline,
    'label': label_pipeline
}

model = resnet18(pretrained=True)
model.fc = torch.nn.Linear(512, 2)
model.to("cuda:0")

loader = Loader(write_path, batch_size=bs, num_workers=num_workers,
                order=OrderOption.RANDOM, pipelines=pipelines)

optimizer = torch.optim.Adam(model.parameters())
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(epochs):
    for images, labels in tqdm(loader, desc=f"Epoch {epoch}:"):
        images = images.to("cuda:0")
        labels = labels.to("cuda:0")
        preds = model(images)
        loss = loss_fn(preds, labels)
        loss.backward()
        optimizer.step()

GuillaumeLeclerc commented 2 years ago

Hello,

Thank you for the report! Is there a way you could upload a potentially reduced size dataset so that we can reproduce and identify(if the data is private you can just replace the images with white noise when you generate the .beton). Your code is technically correct and should definitely not hang and I want to understand what is going on. However it seems you are using both RANDOM and ProcessCache which are not really designed to work together (see our tuning guides for more details). Just to help me debug, could you also try:

try with QUASI_RANDOM ordering
try with RANOM but with os_cache=True

Thanks!

bsmietanka commented 2 years ago

I used multiple combinations of Loader parameters: all OrderOption with both os_cache=True and os_cache=False. I got the same results in both scenarios. Also, I tried loading just the smallest possible subset of my dataset by setting indices=[0]. But with no success. I'll try to upload some version of dataset tomorrow (either white noise or just reduced size), but I'll just provide some information about it: .beton file takes about 5.5GB, dataset is pretty small, images were saved with jpeg_quality=90 and max_resolution=256.

GuillaumeLeclerc commented 2 years ago

When you generate a .beton you can pass an array of indices you actualyl want to include in the dataset. Could you try with just the minimum of samples required to make the issue occur so that uploading/downloading is not a problem (I have a very limited internet connection right now).

bsmietanka commented 2 years ago

I generated .beton file on a smaller subset of images and this way I found out it was a problem with the FFCV dataset. During dataset creation I got assert error from OpenCV resize function, when applied to images with either height or width == 1. This probably corrupted the .beton file. After I deleted those images it worked. Maybe it's worth checking such cases when saving a dataset or validating .beton file on Loader initialization.

But this is not the end of my issues. First thing was that this pipeline was working:

# Random resized crop
decoder = RandomResizedCropRGBImageDecoder((128, 128), (0.6, 1.))
normalizer = NormalizeImage(np.array([0.485, 0.456, 0.406]), np.array([0.229, 0.224, 0.225]), np.float32)

# Data decoding and augmentation
image_pipeline = [decoder, ToTensor(), ToTorchImage(), ToDevice(0), normalizer]

But this was throwing exception AssertionError: Can't be in JIT mode and on the GPU:

image_pipeline = [decoder, ToTensor(), ToTorchImage(), normalizer, ToDevice(0)]

The second approach was more logical to me, to send to the device as the last thing.

Second issue I got is that using FFCV gave me significant performance hit compared to pure PyTorch (time of one epoch on FFCV: ~5 minutes, Pytorch: ~3.5 minutes, average time after first warmup epoch). Could it be because of my dataset? It consists of many very small images.

Thanks for the quick replies.

GuillaumeLeclerc commented 2 years ago

Hello @bsmietanka,

Sorry for the delay. Both of your pipelines should work. They do different things. In one case the normalization is performed on the GPU and the second is done on the CPU. I have seen similar issues with the original release. Could you try with the latest to see if it helps. If not could you post the complete code?

About the second problem, I suspect it must be a configuration problem. In theory there is no situation where FFCV is slower than pytorch. Could you post the comparison between the two versions (Pytorch vs FFCV).

GuillaumeLeclerc commented 2 years ago

Closing due to inactivity. Feel free to re-open when you have more information

libffcv / ffcv

Loader stuck indefinitely in constructor #113