Varying Beton file Size Issue!!

ByungKwanLee commented 2 years ago

Hello!

While I examined the code for where the effectiveness of the changed dataloader comes from, I experienced something strange results on converting data to beton extension. Whenever I try to convert CIFAR10 (e.g., CIFAR100/SVHN/Tiny-ImageNet/ImageNet) for beton, its converted beton file size is not the same. Even, once I change the argument of num_workers from 0 to 16, file size difference has more its gap increasing. I would like to know why this randomness of beton size occurs.

The following code is what I made for saving beton extension according to dataset.

# Import built-in module
import os
import argparse

# fetch args
parser = argparse.ArgumentParser()

# parameter
parser.add_argument('--dataset', default='imagenet', type=str)
parser.add_argument('--gpu', default='0', type=str)
args = parser.parse_args()

# GPU configurations
os.environ["CUDA_VISIBLE_DEVICES"]=args.gpu

# init fast dataloader
from utils.fast_data_utils import save_data_for_beton
save_data_for_beton(dataset=args.dataset)

And, the following one represents the function's definition of save_data_for_beton which is used in the above code


def save_data_for_beton(dataset, root='../data'):
    if dataset == 'cifar10':
        trainset = torchvision.datasets.CIFAR10(root=root, train=True, download=True)
        testset = torchvision.datasets.CIFAR10(root=root, train=False, download=True)

    if dataset == 'cifar100':
        trainset = torchvision.datasets.CIFAR100(root=root, train=True, download=True)
        testset = torchvision.datasets.CIFAR100(root=root, train=False, download=True)

    if dataset == 'svhn':
        trainset = torchvision.datasets.SVHN(root=root, split='train', download=True)
        testset = torchvision.datasets.SVHN(root=root, split='test', download=True)

    if dataset == 'tiny':
        trainset = torchvision.datasets.ImageFolder(root + '/tiny-imagenet-200/train')
        testset = torchvision.datasets.ImageFolder(root + '/tiny-imagenet-200/val')

    if dataset == 'imagenet':
        trainset = torchvision.datasets.ImageFolder('/imagenet_path')
        testset = torchvision.datasets.ImageFolder('/imagenet_path')

        # for large dataset
        datasets = {
            'train': trainset,
            'test': testset
        }
        for (name, ds) in datasets.items():

            writer = DatasetWriter(f'/mnt/hard1/lbk/{dataset}/{dataset}_{name}.beton', {
                'image': RGBImageField(write_mode='smart',
                                       max_resolution=256,
                                       compress_probability=0.50,
                                       jpeg_quality=90),
                'label': IntField(),
            }, num_workers=16)
            writer.from_indexed_dataset(ds, chunksize=100)

    else:
        # for small dataset
        datasets = {
            'train': trainset,
            'test': testset
        }
        for (name, ds) in datasets.items():
            writer = DatasetWriter(f'{root}/../ffcv_data/{dataset}/{dataset}_{name}.beton', {
                'image': RGBImageField(),
                'label': IntField(),},
                num_workers=16)
            writer.from_indexed_dataset(ds)

GuillaumeLeclerc commented 2 years ago

This is expected. As for most multi-threaded applications execution is always random. The only way to have a deterministic (and the smallest size possible) is to have num_workers=0.

There is also a second factor impacting file size: .beton files are organized in "pages": big blocks of memory (default size = 8MB). Each worker works on its own page so that they do not "disturb" each others. The thing is at the end of the dataset each worker will not completely fill their last page (on average they will be half full). So the expected empty space in the file is PAGE_SIZE / 2 * num_workers. It's up to you to decide the trade-off between wasted space and speed to generate the dataset. On CIFAR it probably doesn't make sense to use too many cores. On ImageNet 400MB overhead on 96 workers isn't much.

JacobARose commented 2 years ago

There is also a second factor impacting file size: .beton files are organized in "pages": big blocks of memory (default size = 8MB). Each worker works on its own page so that they do not "disturb" each others.

@GuillaumeLeclerc This explains why there is no mention of sharding afaik in the FFCV documentation, and I think should certainly be mentioned if not emphasized and touted. Formats like WebDataset and TFRecords almost always require additional engineering on the part of the user to decide beforehand what a reasonable number of shards might be.

However, it does seem to come with one drawback, which is that relying on the successful read/write of one giant file vs. many smaller files is simply much less robust. Most significantly (imo) It prevents the ability of users to restart a writing process if it was interrupted without starting from the beginning. I'm running into this issue while trying to convert the newish Herbarium 2022 competition dataset, which has something like 800k+ training images, adding up to at least 200+GB that need to be serialized into a single file.

Without a user-facing API for inspecting/repairing incomplete beton files, or an alternative where it's made demonstrably simpler to produce sets of beton files to be used as shards of a single dataset, I imagine this framework will be limited in the space of big to very big data (which may not be on the developers' radar, completely valid).

GuillaumeLeclerc commented 2 years ago

I don't think the size of "shards" in FFCV is as important as it is in other formats. As long as it can fit a couple samples, increasing the size of a shard will lead to no benefits.

We have FFCV version of YFCC100m and have not faced any issue generating them. Maybe you could open an issue with what you are experiencing ?

libffcv / ffcv

Varying Beton file Size Issue!! #142