Proper Shared Memory support for large COCO Datasets

h-fernand commented 5 months ago

Describe the feature When training on a COCO format dataset only one copy of the dataset annotations should be loaded into RAM by the primary process and all other GPU process dataloaders should pull from this copy in shared memory.

Motivation Duplicating a large COCO dataset's annotations in a distributed training environment quickly eats all available system RAM. This makes training on large datasets impossible even in systems with over 1TB of RAM.

PeterVennerstrom commented 5 months ago

Sounds like a fun project!

A quick workaround would be to serialize the annotations on a per-image basis and load them by index just like images in the pipeline. Then the annotation file would only need contain image paths and annotation file paths.

BaseDataset has a load_data_list method that should be overwritten to handle a path to a serialized annotation file as an annotation. The COCO dataset in MMDet has an example of overwriting this method to handle COCO specific annotations.

LoadAnnotations should be subclassed to first read in the annotation from disk, then proceed through the rest of the original LoadAnnotations methods as appropriate.

Make sure filter_data still behaves as it should.

h-fernand commented 5 months ago

I'm a little confused as to how exactly you would go about implementing this. From looking at CocoDataset and BaseDetDataset I would've assumed that I'd have to modify __getitem__ to deserialize each individual annotation file. I've never worked with the LoadAnnotations transform directly and I think my understanding of exactly how MMDetection dataloaders work is a little poor. Would __getitem__ simply fetch a filename and LoadAnnotations would actually read the annotation in? In what method would I read the annotations from the disk? How would filter_data be able to actually filter out annotations without reading them?

h-fernand commented 4 months ago

I've been working on a solution for this as @PeterVennerstrom described. I assume this approach would also require reimplementing the coco_metric evaluator such that it also lazily loads annotations?

h-fernand commented 4 months ago

If anybody stumbles upon this thread looking for answers, it is a little more complicated than the initial reply makes it sound. You will have to do the following:

Reformat your dataset to consist of one main dataset file listing image/annotation path pairs and individual annotation files for each image
Create a Dataset class to handle loading your list of image and annotation path pairs
Create a Lazy Loading Annotation transform which not only loads in your individual annotation file but formats it precisely how the detection code expects it to be formatted (this was the source of many headaches, the original COCOdataset loader converts [x, y, w, h] boxes into [x1, y1, x2, y2] boxes and will fail silently with a GPU-related error if this format is not met).
Completely refactor the coco_metric evaluator to not rely on pycocotools at all (currently stuck at this step)

I'm not sure if any other changes are necessary to make this work, but I'm trucking along regardless. It would be really cool if proper lazy loading could be implemented. My solution is very hacky and I'm sure the ability to work with very large datasets is not unique to my use case.

PeterVennerstrom commented 4 months ago

Apologize if I made this sound easier than it ended up.

On evaluation, unless there is a need to evaluate a test set too large to fit into memory it could be performed as before from a standard COCO dataset loaded into memory. The train/val/test sets are independently defined and created. In 3.x the COCO metric evaluation object is separate from the dataset object. In any case, if COCO metrics are needed recommend calculating them with pycocotools.

Filter GT could be applied separate to the training code when converting the annotations to their serialized format. For COCO it filters out images based on height and width and the existence of an instance.

open-mmlab / mmdetection

Proper Shared Memory support for large COCO Datasets #11403