openvinotoolkit / datumaro

Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
https://openvinotoolkit.github.io/datumaro/
MIT License
538 stars 133 forks source link

Sample from a specific label #522

Open brunovollmer opened 3 years ago

brunovollmer commented 3 years ago

Hey everybody,

I was wondering if there is a way to sample from a specific label. My situation is that I have a dataset where one class is heavily over represented and I was wondering if there is an option to just sample from this label and keep everything else similar.

Thanks in advance,

Bruno

zhiltsov-max commented 3 years ago

Hi, could you please describe more precisely the operations you're trying to do and the expected results? What do you mean by "sample" exactly? I can suggest you to look into the directions of filtering with custom filter expression (label == class, label != class), or NDR transform, or splitting by task-specific splitters.

brunovollmer commented 3 years ago

Hey @zhiltsov-max

so the distribution of my dataset is the following:

Label distribution:
* bench: 418366 -- 2.5%
* bicycle: 215504 -- 1.3%
* bus: 144856 -- 0.9%
* car: 1478262 -- 8.9%
* chair: 2030188 -- 12.2%
* dog: 74188 -- 0.4%
* laptop: 92986 -- 0.6%
* person: 12008280 -- 72.4%
* phone: 121542 -- 0.7%

As you can see, the person class is heavily over represented. What I would like to do is to have an operation where datumaro randomly picks a certain percentage of annotations/images from a class (in my class person) and removes the rest. The result should then contain the same amount of bboxes for each other class and the reduced amount for the picked class.

zhiltsov-max commented 3 years ago

Probably, there is no ready-to-use solution for this now. AFAIK, datasets are typically formed to have nearly equal class distribution, or they need to be split into subsets with the initial distribution preserved. The latter case is already covered by the task-oriented splitters. Probably you could use a simple script to get what you want:

from datumaro.components.dataset import Datumaro
from datumaro.components.extractor import Transform, AnnotationType

class Sampler(Transform):
    def __iter__(self):
        person_anns = 0
        required_quantity = 10000
        person_label_idx = self._extractor.categories()[AnnotationType.label].find('person')[0]

        for item in self._extractor:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == person_label_idx:
                    if person_anns >= required_quantity:
                        continue
                    else:
                        person_anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

dataset = Dataset.import_from('path/', 'format_name')
dataset.transform(Sampler)
dataset.export('new_path/', 'format', save_images=True)
brunovollmer commented 3 years ago

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I guess I'm doing something wrong. Any suggestions?

brunovollmer commented 3 years ago

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

zhiltsov-max commented 3 years ago

Thanks for the extensive reply. I'm trying to integrate your code but I'm having some problems with the python version of datumaro. Due to some previous operations I have a datumaro project that has the data (annotations + images) I want. Unfortunately I can't load it as a dataset (Error: no data of format coco at this path) and when I use a datumaro Project I can load it Project.load(path) but then I can't convert the project to a dataset as the project does not have a variable called working_tree.

I suppose, you're using an outdated version of the library. The latest is the 0.2, which version you're on? datum --version. In the previous versions it was:

from datumaro.components.project import Project

project = Project.load(path)
dataset = project.make_dataset()
...
dataset = Dataset.from_extractors(dataset.transform(...))

And my second question was if there is an easy way to pass variables to the Sampler class from your example?

class Sampler(Transform):
    def __init__(self, extractor, option1='foo', option2=42):
        super().__init__(extractor)
        self._option1 = option1
        self._option2 = option2

...

dataset.transform(Sampler, option1='bar', option2=36)
brunovollmer commented 3 years ago
class Sampler(Transform):
    def __init__(self, extractor, label=None, number=None):
        super().__init__(extractor)
        self._label = label
        self._number = number

    def __iter__(self):
        anns = 0
        label_idx = self._extractor.categories()[AnnotationType.label].find(self._label)[0]

        items = random.sample(list(self._extractor), len(list(self._extractor)))

        for item in items:
            new_anns = []
            for ann in item.annotations:
                if hasattr(ann, 'label') and ann.label == label_idx:
                    if anns >= self._number:
                        continue
                    else:
                        anns += 1

                new_anns.append(ann)
            if new_anns:
                yield item.wrap(annotations=new_anns)

def main(args):
    project = Project.load(args.input)
    dataset = project.make_dataset()
    dataset = Dataset.from_extractors(dataset.transform(Sampler, label=args.label, number=args.number))
    dataset.export(args.output, 'yolo', save_images=False)

if __name__ == '__main__':
    main(parse_args())

This is the current version of my Sampler. When I run it I receive this error:

Traceback (most recent call last):
  File "datumaro_sampler.py", line 59, in <module>
    main(parse_args())
  File "datumaro_sampler.py", line 56, in main
    dataset.export(args.output, 'yolo', save_images=False)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/util/__init__.py", line 203, in wrapped_func
    func(*args, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/dataset.py", line 774, in export
    converter.convert(self, save_dir=save_dir, **kwargs)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/components/converter.py", line 33, in convert
    return converter.apply()
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/datumaro/plugins/yolo_format/converter.py", line 92, in apply
    with open(annotation_path, 'w', encoding='utf-8') as f:
OSError: [Errno 9] Bad file descriptor: '/home/azureuser/cloudfiles/code/Users/mael/datasets/Objects365/objects_365_soundmap/objects_365_soundmap-map_subsets/sampled/obj_valid_data/images/train2017/objects365_v2
_01826801.txt'
zhiltsov-max commented 3 years ago

Hi, the script looks correct. From the error message I can see that you're probably using a mounted directory to work with Azure cloud, is it correct? I think the error can be related to this - maybe, there were too much I/O requests or something similar. Can you share some details about the drive mounting options (without personal data, of course)? We haven't tested such scenario yet, so I can suggest to try to export on a local filesystem and then copy manually to the cloud.

zhiltsov-max commented 2 years ago

Hi, please check if #640 is useful for you.

tdhooghe commented 2 years ago

Hi, just wanted to shed my light on the code above. I tried this code and noticed that it stops drawing labels of a given if a certain threshold is reached. I don't think this leads to wanted behavior, as it removes labels from images that contain other labels as well. Hence, now we are left with images that are unlabeled w.r.t. a given class while the label should be there. Therefore, the model will be trained on images with missing labels and might be unfairly penalized by the loss function.

Do you agree, or am I missing something here?

zhiltsov-max commented 2 years ago

Hi, yes, it is true. The solution can produce under-annotated images, when there are more than 1 annotation per image. In the PR referenced (https://github.com/openvinotoolkit/datumaro/pull/640), we took a different way, which works on the image level and doesn't have this problem. It may produce different distribution of annotations than requested depending on the data available, but the resulting images will contain all the annotations.

tdhooghe commented 2 years ago

Thank you very much for letting me know! Could you maybe provide an example of how I can use this class with my dataset with the API approach?

I am using the following line, and it does not seem to work, as the number of 'person' labels stays the same: seed = 1234 sampled_coco = coco_dataset.transform(LabelRandomSampler, label_counts={'person': 100000}, seed=seed)

Also, could you explain what the count argument is for?

zhiltsov-max commented 2 years ago

@tdhooghe, you can find the parameter descriptions here and API usage examples here.

Basically, the count parameter is applied to all classes, while counts for specific classes can be set explicitly with label_counts.