[DetectionDataset] - enable lazy dataset loading

hardikdava commented 1 year ago

Search before asking

[X] I have searched the Supervision issues and found no similar bug report.

Bug

sv.DetectionDataset is loading images unnecessary. It is suggestable that it only loads image when it is necessary. This can be useful for loading large dataset without keeping memory.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[X] Yes I'd like to help by submitting a PR!

SkalskiP commented 1 year ago

Hi @hardikdava 👋🏻!

Here is my idea. Let's create a set of separate methods sv.DetectionDataset.generate_from_*. Unlike sv.DetectionDataset.from_*, it would return a Python generator. What do you think?

Typing

sv.DetectionDataset.generate_from_yolo(
    images_directory_path: str, 
    annotations_directory_path: str
) -> Generator[Tuple[str, sv.Detections, np.ndarray], None, None]:
    pass

Usage example

for path, image, detections in sv.DetectionDataset.generate_from_yolo(...):
    pass

Method names

- sv.DetectionDataset.generate_from_yolo(...)
- sv.DetectionDataset.generate_from_coco(...)
- sv.DetectionDataset.generate_from_pascal_voc(...)

hardikdava commented 1 year ago

@SkalskiP Is there anyway that we only modify current APIs, otherwise users will be confused between sv.DetectionsDataset.from_* and sv.DetectionDataset.generate_from* methods.

hardikdava commented 1 year ago

@SkalskiP is it possible that we use callback system for loading images then we do not have to worry about much things.

SkalskiP commented 1 year ago

@hardikdava didn't you tell me a few weeks ago that callbacks make everything more complicated?

tfriedel commented 1 year ago

I also ran into this issue: https://github.com/autodistill/autodistill/issues/45

In this case the problem was with the ClassificationDataset. I would suggest to just keep track of image paths instead of images and then load them whenever an image is accessed. A relatively easy way to implement this would be to replace the "images" dict that maps from str to ndarray with kind of a lazy loading dict where the setter just sets filenames but the getter loads the image. I'm not sure where these classes are used and if it is critical for performance, like during training of an image classification model. I'm assuming it's not used for this case, but if it were I'd probably just resort to using more efficient solutions like pytorch datasets + dataloaders.

tfriedel commented 1 year ago

Example:

from collections.abc import MutableMapping

class LazyLoadDict(MutableMapping):
    def __init__(self, initial_data: Dict[str, str]):
        self._data = initial_data

    def __getitem__(self, key: str) -> np.ndarray:
        return cv2.imread(self._data[key])

    def __setitem__(self, key: str, value: str) -> None:
        self._data[key] = value

    def __delitem__(self, key: str) -> None:
        del self._data[key]

    def __iter__(self):
        return iter(self._data)

    def __len__(self):
        return len(self._data)

@dataclass
class ClassificationDataset(BaseDataset):
    classes: List[str]
    images: LazyLoadDict
    annotations: Dict[str, Classifications]

    def __len__(self) -> int:
        return len(self.images)

    def split(self, split_ratio=0.8, random_state=None, shuffle: bool = True) -> Tuple[ClassificationDataset, ClassificationDataset]:
        image_names = list(self.images.keys())
        train_names, test_names = train_test_split(
            data=image_names,
            train_ratio=split_ratio,
            random_state=random_state,
            shuffle=shuffle,
        )
        train_dataset = ClassificationDataset(
            classes=self.classes,
            images=LazyLoadDict({name: self.images._data[name] for name in train_names}),
            annotations={name: self.annotations[name] for name in train_names},
        )
        test_dataset = ClassificationDataset(
            classes=self.classes,
            images=LazyLoadDict({name: self.images._data[name] for name in test_names}),
            annotations={name: self.annotations[name] for name in test_names},
        )
        return train_dataset, test_dataset

    # ... (rest of the methods, adjusted to use LazyLoadDict when needed)

hardikdava commented 1 year ago

Thanks @tfriedel for suggestions. We will take a look into it soon. This might be the solution of our current issue.

tfriedel commented 1 year ago

I implemented this bit to solve the issue of training a 10.000+ image dataset on my machine. I did this both for ClassificationDataset and DetectionDataset. Additionally I also had to swap out the dict detections_map and replace it with a shelve (basically a dict that's stored on disk). The results were basically segmentation maps, and those also consumed too much memory. The modifications were done both to the supervision package and the autodistill base models. I'm not sure if this is enough and I could make a PR for those two bits. But you want to extend it probably. I also don't think the shelve solution is the most elegant, but it solved my urgent need in the quickest way.

hardikdava commented 1 year ago

@tfriedel feel free to open a PR. Please visit contribution guide before you make a PR.

tfriedel commented 1 year ago

I added two PRs: https://github.com/roboflow/supervision/pull/353 https://github.com/autodistill/autodistill/pull/48

Please feel free to make further changes to those.

roboflow / supervision