I am using ffcv loaders with huggingface's accelerate for single-node multi-GPU training using 8x A100 with 16GB memory each.
When using ffcv, my training loop is slower and I need to use a lower batch size (48 rather than 32), and I additionally experience memory leaks (CUDA OOM after two epochs).
Writing .beton
class NumpyLabels:
def __init__(self, dataset):
self.dataset = dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
img, labels = self.dataset[i]
return (img, labels.numpy().astype(np.int16))
def main():
dataset = NumpyLabels(HierarchicalImageFolder("/mnt/10tb/data/train"))
writer = ffcv.writer.DatasetWriter(
"/mnt/10tb/data/train.beton",
{
"image": ffcv.fields.RGBImageField(max_resolution=192),
"label": ffcv.fields.NDArrayField(
# int16 is fine for predicting up to 10000 classes
# 2 ^ 16 = 65,536
shape=(7,),
dtype=np.dtype("int16"),
),
},
num_workers=32,
)
writer.from_indexed_dataset(dataset, chunksize=1000)
My GPU utilization is between 100 and 75%, and memory usage is only 75% on GPUs 1-7 because it is much higher on GPU 0 (so I cannot increase batch size).
Do you have any advice on how to improve performance?
I am using ffcv loaders with huggingface's accelerate for single-node multi-GPU training using 8x A100 with 16GB memory each.
When using ffcv, my training loop is slower and I need to use a lower batch size (48 rather than 32), and I additionally experience memory leaks (CUDA OOM after two epochs).
Writing .beton
Dataloader
My dataset is about 2.7M 192x192 images.
My GPU utilization is between 100 and 75%, and memory usage is only 75% on GPUs 1-7 because it is much higher on GPU 0 (so I cannot increase batch size).
Do you have any advice on how to improve performance?