pythonlessons / mltu

Machine Learning Training Utilities (for TensorFlow and PyTorch)
MIT License
160 stars 100 forks source link

Augmentors are replacing original examples instead of adding more examples? #8

Closed seidnerj closed 1 year ago

seidnerj commented 1 year ago

I am looking a the process_data(self, batch_data) function under the DataProvider class, there it can be seen that for each "batch data", i.e. a labeled example, all augmentors are applied in order then all transformers are applied in order:

    # Then augment, transform and postprocess the batch data
    for objects in [self._augmentors, self._transformers]:
        for object in objects:
            data, annotation = object(data, annotation)

Isn't the purpose of augmentors to add more examples to increase the training set? i.e., for each example, it should add to the training set both the original example, as well as the an "augmented" variation, preferably multiple augmented versions per a single example?

I am I misunderstanding?

Seems like process_data(self, batch_data) should look something like this:

def process_data(self, batch_data):
    """ Process data batch of data """
    if self._use_cache and batch_data[0] in self._cache:
        data, annotation = copy.deepcopy(self._cache[batch_data[0]])
    else:
        data, annotation = batch_data
        for preprocessor in self._data_preprocessors:
            data, annotation = preprocessor(data, annotation)

        if data is None or annotation is None:
            self.logger.warning("Data or annotation is None, marking for removal on epoch end.")
            self._on_epoch_end_remove.append(batch_data)
            return None, None

        if self._use_cache and batch_data[0] not in self._cache:
            self._cache[batch_data[0]] = (copy.deepcopy(data), copy.deepcopy(annotation))

    # Then transform, augment and postprocess the batch data
    for transformer in self._transformers:
        data, annotation = transformer(data, annotation)

    augmented_data_list = []
    if len(self._augmentors) > 0:
        for i in range(self._variation_count):  # generate multiple variations using specified augmentors
            augmented_data = data
            for augmentor in self._augmentors:
                augmented_data, annotation = augmentor(augmented_data, annotation)

            augmented_data_list.append((augmented_data, annotation))

    all_data_list = []
    for data, annotation in [(data, annotation)] + augmented_data_list:

        # Convert to numpy array if not already
        if not isinstance(data, np.ndarray):
            data = data.numpy()

        # Convert to numpy array if not already
        # TODO: This is a hack, need to fix this
        if not isinstance(annotation, (np.ndarray, int, float, str, np.uint8, float)):
            annotation = annotation.numpy()

        all_data_list.append((data, annotation))

    return all_data_list

With getitem(self, index: int) looking something like this:

def __getitem__(self, index: int):
    """ Returns a batch of data by batch index"""
    dataset_batch = self.get_batch_annotations(index)

    # First read and preprocess the batch data
    batch_data, batch_annotations = [], []
    for index, batch in enumerate(dataset_batch):
        for data, annotation in self.process_data(batch):
            if data is None or annotation is None:
                self.logger.warning("Data or annotation is None, skipping.")
                continue

            batch_data.append(data)
            batch_annotations.append(annotation)

    return np.array(batch_data), np.array(batch_annotations)
pythonlessons commented 1 year ago

I am looking a the process_data(self, batch_data) function under the DataProvider class, there it can be seen that for each "batch data", i.e. a labeled example, all augmentors are applied in order then all transformers are applied in order:

    # Then augment, transform and postprocess the batch data
    for objects in [self._augmentors, self._transformers]:
        for object in objects:
            data, annotation = object(data, annotation)

Isn't the purpose of augmentors to add more examples to increase the training set? i.e., for each example, it should add to the training set both the original example, as well as the an "augmented" variation, preferably multiple augmented versions per a single example?

I am I misunderstanding?

Seems like process_data(self, batch_data) should look something like this:

def process_data(self, batch_data):
    """ Process data batch of data """
    if self._use_cache and batch_data[0] in self._cache:
        data, annotation = copy.deepcopy(self._cache[batch_data[0]])
    else:
        data, annotation = batch_data
        for preprocessor in self._data_preprocessors:
            data, annotation = preprocessor(data, annotation)

        if data is None or annotation is None:
            self.logger.warning("Data or annotation is None, marking for removal on epoch end.")
            self._on_epoch_end_remove.append(batch_data)
            return None, None

        if self._use_cache and batch_data[0] not in self._cache:
            self._cache[batch_data[0]] = (copy.deepcopy(data), copy.deepcopy(annotation))

    # Then transform, augment and postprocess the batch data
    for transformer in self._transformers:
        data, annotation = transformer(data, annotation)

    augmented_data_list = []
    if len(self._augmentors) > 0:
        for i in range(self._variation_count):  # generate multiple variations using specified augmentors
            augmented_data = data
            for augmentor in self._augmentors:
                augmented_data, annotation = augmentor(augmented_data, annotation)

            augmented_data_list.append((augmented_data, annotation))

    all_data_list = []
    for data, annotation in [(data, annotation)] + augmented_data_list:

        # Convert to numpy array if not already
        if not isinstance(data, np.ndarray):
            data = data.numpy()

        # Convert to numpy array if not already
        # TODO: This is a hack, need to fix this
        if not isinstance(annotation, (np.ndarray, int, float, str, np.uint8, float)):
            annotation = annotation.numpy()

        all_data_list.append((data, annotation))

    return all_data_list

With getitem(self, index: int) looking something like this:

def __getitem__(self, index: int):
    """ Returns a batch of data by batch index"""
    dataset_batch = self.get_batch_annotations(index)

    # First read and preprocess the batch data
    batch_data, batch_annotations = [], []
    for index, batch in enumerate(dataset_batch):
        for data, annotation in self.process_data(batch):
            if data is None or annotation is None:
                self.logger.warning("Data or annotation is None, skipping.")
                continue

            batch_data.append(data)
            batch_annotations.append(annotation)

    return np.array(batch_data), np.array(batch_annotations)

Hey, thanks for the question, but the idea is different than you think. When we are training our models, we don't want to change the number of data samples in our dataset. But yes, we want to return original examples and modified examples, but we do this randomly each training epoch. This is why we randomly augment our data in augmentors, where we choose the randomness coefficient, how often we want to return modified examples, and how often the original ones.

seidnerj commented 1 year ago

I get what is going on in the code, but why would we want to randomly change our original examples, without even using the original examples? as far as I understand the purpose of augmentation is to "artificially increase the training set by creating modified copies of a dataset using existing data" in order to:

  1. To prevent models from overfitting.
  2. The initial training set is too small.
  3. To improve the model accuracy.
  4. To Reduce the operational cost of labeling and cleaning the raw dataset.

(Source for the above: https://www.datacamp.com/tutorial/complete-guide-data-augmentation)

Thoughts?

pythonlessons commented 1 year ago

Who said we are not using original examples? If we have 1000 images, and augmentor has a chance of 50%, then 500 images will be original and 500 modified

seidnerj commented 1 year ago

Yes, you're correct, but why not use 100% of the original examples (that are in the training set, of course) and then "augment" that data set with additional examples? I gather this is the purpose of augmentation?

pythonlessons commented 1 year ago

You are not training the model for 1 epoch, probably you gonna train it for 50 epochs at least. Because we are augmenting randomly picked images, original photos will be used 50% of the time through all these epochs. So what is the problem? Want to use less augmented photos? Then set augmentor random change to 30% and etc.

Because, for example, you are using 10 different augmentors, then, for example, from 10k images, you receive 100k images, and what happens if you store all of them in RAM? You may get out of ram. So there is no reason to hold original images + augmented in one place (in a single list, for example) because you still gonna use some batch size that fits with your model. My solution is efficient, simple, and expandable. I hope you understand :)

seidnerj commented 1 year ago

Yes, thanks a lot for the explanation! closing this.