voxel51 / fiftyone

The open-source tool for building high-quality datasets and computer vision models
https://fiftyone.ai
Apache License 2.0
7.89k stars 518 forks source link

[BUG] Memory leak when using `add_coco_labels` for instance segmentation with coco_id_field set #4407

Open h-fernand opened 1 month ago

h-fernand commented 1 month ago

Describe the problem

When trying to add COCO format instance segmentation prediction data to my dataset using add_coco_labels the program will begin rapidly using RAM until eventually it runs out of RAM and crashes. This only happens if I set the coco_id_field to coco_id so that I can sync up my annotations with my samples properly. If I omit the coco_id_field and let the function run with the default behavior, my annotations get mismatched but the program does not eat nearly as much RAM and actually does finish running. This code also produces the same erroneous behavior if I provide add_coco_labels with a view containing only the test data split instead of the whole dataset.

Code to reproduce issue

import fiftyone as fo
import fiftyone.utils.coco as fouc

dataset_name = "dataset"
splits = ['train', 'val', 'test']
dataset_root = '/path/to/dataset/root'
annotations_dir = 'annotations
annfile_template = 'instances_{split}.json'

predictions_file = '/path/to/predictions/file.json'

combined_dataset = fo.Dataset(name=dataset_name, persistent=True)

for split in splits:
    print(f"Loading: {split} dataset")

    annfile = f"{dataset_root}/{annotations_dir}/{annfile_template.format(split=split)}"
    data_path = f"{dataset_root}/{split}"
    split_dataset_name = f"ground_truth_{split}"

    split_dataset = fo.Dataset.from_dir(
        data_path=data_path,
        labels_path=annfile,
        dataset_type=fo.types.COCODetectionDataset,
        name=split_dataset_name,
        include_id=True,
        persistent=True
    )
    split_dataset.tag_samples(split)
    combined_dataset.merge_samples(split_dataset)

with open(predictions_file, 'r') as f:
    prediction_data = json.load(f)

predictions = prediction_data['annotations']
classes = prediction_data['categories']
classes = [x['name'] for x in classes]

fouc.add_coco_labels(combined_dataset, "predictions", predictions, classes, label_type="segmentations", coco_id_field="coco_id")

System information

Willingness to contribute

The FiftyOne Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the FiftyOne codebase?

h-fernand commented 1 month ago

As an update, it appears that this dramatic memory usage occurs no matter how the function is used successfully. The reason it did not eat all of the RAM without the coco_id_field set was because the annotations were created in the wrong order. When fixing the order of the annotations the memory leak occurs. I'm convinced this is a memory leak because my prediction annotation file is only 2GB, there are only 1000 images in the test set that I'm adding predictions to, and the program ends up eating all of the RAM on a system with 256GB of RAM.

brimoor commented 1 month ago

@h-fernand this sounds similar to the issue reported in https://github.com/voxel51/fiftyone/issues/4293 which has been resolved in https://github.com/voxel51/fiftyone/pull/4354.

(FYI the above patch will be released in fiftyone==0.24.0 which is scheduled for next week)

h-fernand commented 1 month ago

That's great news, I'll try the patch out once it's released and hopefully it resolves the issue. I'll post an update in this thread once it's been released.