libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.87k stars 179 forks source link

ffcv imagenet won't start #376

Open VarusJ opened 6 months ago

VarusJ commented 6 months ago

Hi! I am training a resnet50 using ffcv imagenet here is my config image

I am having trouble getting it to start as shown here image It won't start at 0. I checked CUDA is working fine. Please help! Thanks a lot!

VarusJ commented 6 months ago

I set some debug point and it turns out there is nothing yield in the for loop of a ffcv loader:

for ix, (images, target) in enumerate(train_loader): .....

I define the train_loader as follows:

def create_train_loader(self, train_dataset, num_workers, batch_size,
                            distributed, in_memory):
        this_device = f'cuda:{self.gpu}'
        train_path = Path(train_dataset)
        assert train_path.is_file()

        res = self.get_resolution(epoch=0)
        self.decoder = RandomResizedCropRGBImageDecoder((res, res))
        gaussian_kernel_size = 5
        sigma = 2
        image_pipeline: List[Operation] = [
            self.decoder,
            RandomHorizontalFlip(),
            ToTensor(),
            transforms.RandomApply([transforms.GaussianBlur(gaussian_kernel_size, sigma)], p=0.5),
            ToDevice(ch.device(this_device), non_blocking=True),
            ToTorchImage(),
            NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float16)
        ]

        label_pipeline: List[Operation] = [
            IntDecoder(),
            ToTensor(),
            Squeeze(),
            ToDevice(ch.device(this_device), non_blocking=True)
        ]

        order = OrderOption.RANDOM if distributed else OrderOption.QUASI_RANDOM
        loader = Loader(train_dataset,
                        batch_size=batch_size,
                        num_workers=num_workers,
                        order=order,
                        os_cache=in_memory,
                        drop_last=True,
                        pipelines={
                            'image': image_pipeline,
                            'label': label_pipeline
                        },
                        distributed=distributed)

        print("loader: ", loader)

        return loader

Could really use some insights!!!