Open brunovollmer opened 3 years ago
Hi, could you please describe more precisely the operations you're trying to do and the expected results? What do you mean by "sample" exactly? I can suggest you to look into the directions of filtering with custom filter expression (label == class, label != class), or NDR transform, or splitting by task-specific splitters.
Hey @zhiltsov-max
so the distribution of my dataset is the following:
Label distribution:
* bench: 418366 -- 2.5%
* bicycle: 215504 -- 1.3%
* bus: 144856 -- 0.9%
* car: 1478262 -- 8.9%
* chair: 2030188 -- 12.2%
* dog: 74188 -- 0.4%
* laptop: 92986 -- 0.6%
* person: 12008280 -- 72.4%
* phone: 121542 -- 0.7%
As you can see, the person
class is heavily over represented. What I would like to do is to have an operation where datumaro randomly picks a certain percentage of annotations/images from a class (in my class person
) and removes the rest. The result should then contain the same amount of bboxes for each other class and the reduced amount for the picked class.
Probably, there is no ready-to-use solution for this now. AFAIK, datasets are typically formed to have nearly equal class distribution, or they need to be split into subsets with the initial distribution preserved. The latter case is already covered by the task-oriented splitters. Probably you could use a simple script to get what you want:
from datumaro.components.dataset import Datumaro
from datumaro.components.extractor import Transform, AnnotationType
class Sampler(Transform):
def __iter__(self):
person_anns = 0
required_quantity = 10000
person_label_idx = self._extractor.categories()[AnnotationType.label].find('person')[0]
for item in self._extractor:
new_anns = []
for ann in item.annotations:
if hasattr(ann, 'label') and ann.label == person_label_idx:
if person_anns >= required_quantity:
continue
else:
person_anns += 1
new_anns.append(ann)
if new_anns:
yield item.wrap(annotations=new_anns)
dataset = Dataset.import_from('path/', 'format_name')
dataset.transform(Sampler)
dataset.export('new_path/', 'format', save_images=True)
Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project
I can load it Project.load(path)
but then I can't convert the project to a dataset as the project does not have a variable called working_tree
.
I guess I'm doing something wrong. Any suggestions?
And my second question was if there is an easy way to pass variables to the Sampler
class from your example?
Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.
I suppose, you're using an outdated version of the library. The latest is the 0.2, which version you're on? datum --version
. In the previous versions it was:
from datumaro.components.project import Project
project = Project.load(path)
dataset = project.make_dataset()
...
dataset = Dataset.from_extractors(dataset.transform(...))
And my second question was if there is an easy way to pass variables to the Sampler class from your example?
class Sampler(Transform):
def __init__(self, extractor, option1='foo', option2=42):
super().__init__(extractor)
self._option1 = option1
self._option2 = option2
...
dataset.transform(Sampler, option1='bar', option2=36)
class Sampler(Transform):
def __init__(self, extractor, label=None, number=None):
super().__init__(extractor)
self._label = label
self._number = number
def __iter__(self):
anns = 0
label_idx = self._extractor.categories()[AnnotationType.label].find(self._label)[0]
items = random.sample(list(self._extractor), len(list(self._extractor)))
for item in items:
new_anns = []
for ann in item.annotations:
if hasattr(ann, 'label') and ann.label == label_idx:
if anns >= self._number:
continue
else:
anns += 1
new_anns.append(ann)
if new_anns:
yield item.wrap(annotations=new_anns)
def main(args):
project = Project.load(args.input)
dataset = project.make_dataset()
dataset = Dataset.from_extractors(dataset.transform(Sampler, label=args.label, number=args.number))
dataset.export(args.output, 'yolo', save_images=False)
if __name__ == '__main__':
main(parse_args())
This is the current version of my Sampler
. When I run it I receive this error:
Traceback (most recent call last):
File "datumaro_sampler.py", line 59, in <module>
main(parse_args())
File "datumaro_sampler.py", line 56, in main
dataset.export(args.output, 'yolo', save_images=False)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/util/__init__.py", line 203, in wrapped_func
func(*args, **kwargs)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/dataset.py", line 774, in export
converter.convert(self, save_dir=save_dir, **kwargs)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/converter.py", line 33, in convert
return converter.apply()
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/plugins/yolo_format/converter.py", line 92, in apply
with open(annotation_path, 'w', encoding='utf-8') as f:
OSError: [Errno 9] Bad file descriptor: '/home/azureuser/cloudfiles/code/Users/mael/datasets/Objects365/objects_365_soundmap/objects_365_soundmap-map_subsets/sampled/obj_valid_data/images/train2017/objects365_v2
_01826801.txt'
Hi, the script looks correct. From the error message I can see that you're probably using a mounted directory to work with Azure cloud, is it correct? I think the error can be related to this - maybe, there were too much I/O requests or something similar. Can you share some details about the drive mounting options (without personal data, of course)? We haven't tested such scenario yet, so I can suggest to try to export on a local filesystem and then copy manually to the cloud.
Hi, please check if #640 is useful for you.
Hi, just wanted to shed my light on the code above. I tried this code and noticed that it stops drawing labels of a given if a certain threshold is reached. I don't think this leads to wanted behavior, as it removes labels from images that contain other labels as well. Hence, now we are left with images that are unlabeled w.r.t. a given class while the label should be there. Therefore, the model will be trained on images with missing labels and might be unfairly penalized by the loss function.
Do you agree, or am I missing something here?
Hi, yes, it is true. The solution can produce under-annotated images, when there are more than 1 annotation per image. In the PR referenced (https://github.com/openvinotoolkit/datumaro/pull/640), we took a different way, which works on the image level and doesn't have this problem. It may produce different distribution of annotations than requested depending on the data available, but the resulting images will contain all the annotations.
Thank you very much for letting me know! Could you maybe provide an example of how I can use this class with my dataset with the API approach?
I am using the following line, and it does not seem to work, as the number of 'person' labels stays the same:
seed = 1234
sampled_coco = coco_dataset.transform(LabelRandomSampler, label_counts={'person': 100000}, seed=seed)
Also, could you explain what the count
argument is for?
Hey everybody,
I was wondering if there is a way to sample from a specific label. My situation is that I have a dataset where one class is heavily over represented and I was wondering if there is an option to just sample from this label and keep everything else similar.
Thanks in advance,
Bruno