Open hardikdava opened 1 year ago
Hi @hardikdava 👋🏻!
Here is my idea. Let's create a set of separate methods sv.DetectionDataset.generate_from_*
. Unlike sv.DetectionDataset.from_*
, it would return a Python generator. What do you think?
sv.DetectionDataset.generate_from_yolo(
images_directory_path: str,
annotations_directory_path: str
) -> Generator[Tuple[str, sv.Detections, np.ndarray], None, None]:
pass
for path, image, detections in sv.DetectionDataset.generate_from_yolo(...):
pass
- sv.DetectionDataset.generate_from_yolo(...)
- sv.DetectionDataset.generate_from_coco(...)
- sv.DetectionDataset.generate_from_pascal_voc(...)
@SkalskiP Is there anyway that we only modify current APIs, otherwise users will be confused between sv.DetectionsDataset.from_*
and sv.DetectionDataset.generate_from*
methods.
@SkalskiP is it possible that we use callback system for loading images then we do not have to worry about much things.
@hardikdava didn't you tell me a few weeks ago that callbacks make everything more complicated?
I also ran into this issue: https://github.com/autodistill/autodistill/issues/45
In this case the problem was with the ClassificationDataset. I would suggest to just keep track of image paths instead of images and then load them whenever an image is accessed. A relatively easy way to implement this would be to replace the "images" dict that maps from str to ndarray with kind of a lazy loading dict where the setter just sets filenames but the getter loads the image. I'm not sure where these classes are used and if it is critical for performance, like during training of an image classification model. I'm assuming it's not used for this case, but if it were I'd probably just resort to using more efficient solutions like pytorch datasets + dataloaders.
Example:
from collections.abc import MutableMapping
class LazyLoadDict(MutableMapping):
def __init__(self, initial_data: Dict[str, str]):
self._data = initial_data
def __getitem__(self, key: str) -> np.ndarray:
return cv2.imread(self._data[key])
def __setitem__(self, key: str, value: str) -> None:
self._data[key] = value
def __delitem__(self, key: str) -> None:
del self._data[key]
def __iter__(self):
return iter(self._data)
def __len__(self):
return len(self._data)
@dataclass
class ClassificationDataset(BaseDataset):
classes: List[str]
images: LazyLoadDict
annotations: Dict[str, Classifications]
def __len__(self) -> int:
return len(self.images)
def split(self, split_ratio=0.8, random_state=None, shuffle: bool = True) -> Tuple[ClassificationDataset, ClassificationDataset]:
image_names = list(self.images.keys())
train_names, test_names = train_test_split(
data=image_names,
train_ratio=split_ratio,
random_state=random_state,
shuffle=shuffle,
)
train_dataset = ClassificationDataset(
classes=self.classes,
images=LazyLoadDict({name: self.images._data[name] for name in train_names}),
annotations={name: self.annotations[name] for name in train_names},
)
test_dataset = ClassificationDataset(
classes=self.classes,
images=LazyLoadDict({name: self.images._data[name] for name in test_names}),
annotations={name: self.annotations[name] for name in test_names},
)
return train_dataset, test_dataset
# ... (rest of the methods, adjusted to use LazyLoadDict when needed)
Thanks @tfriedel for suggestions. We will take a look into it soon. This might be the solution of our current issue.
I implemented this bit to solve the issue of training a 10.000+ image dataset on my machine. I did this both for ClassificationDataset and DetectionDataset. Additionally I also had to swap out the dict detections_map and replace it with a shelve (basically a dict that's stored on disk). The results were basically segmentation maps, and those also consumed too much memory. The modifications were done both to the supervision package and the autodistill base models. I'm not sure if this is enough and I could make a PR for those two bits. But you want to extend it probably. I also don't think the shelve solution is the most elegant, but it solved my urgent need in the quickest way.
@tfriedel feel free to open a PR. Please visit contribution guide before you make a PR.
I added two PRs: https://github.com/roboflow/supervision/pull/353 https://github.com/autodistill/autodistill/pull/48
Please feel free to make further changes to those.
Search before asking
Bug
sv.DetectionDataset
is loading images unnecessary. It is suggestable that it only loads image when it is necessary. This can be useful for loading large dataset without keeping memory.Environment
No response
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?