[Feature Request] Multiplier-Based Image Selection Over Repeat Counts - Improved Subject Balancing

a-l-e-x-d-s-9 commented 12 months ago

I'm used to EveryDream2 trainer, which gives users the option to select a portion of images from a folder using a multiplier instead of a repeat count. For instance, by applying a 0.5 multiplier to a folder, the trainer will randomly select half of the images for each epoch. This system isn't limited to fractions; users can set multipliers such as 1.6 or any other positive value. On the other hand, using a negative multiplier will exclude the folder.

This multiplier approach retains the repeat functionality users are accustomed to while offering a significant improvement in balancing training across different subjects without having to delete any images.

To illustrate, consider three subjects: A with 1000 images, B with 200 images, and C with 100 images. If the goal is to train on 25 images per epoch, we can set multipliers of 0.025 for A, 0.125 for B, and 0.25 for C. This eliminates the need to delete images or train on all 1000 images from subject A. This method effectively balances subjects, especially when the number of images in their datasets varies. As the number and diversity of subjects increase, this approach provides better control over training, avoiding the need to discard images.

I believe that adding this feature to Kohya is essential.

kohya-ss commented 11 months ago

Thanks for the suggestion! The feature is very interesting.

However, it will take a lot of work to implement. While it is a random selection, it would not be desirable to be completely random. Because some of images might not be selected for long time in training. Therefore, it would require selecting images from each folder in a pre-shuffled order and at a specified ratio, to enumerate all images.

Also, since DataLoader runs in a multi-process fashion, we would have to synchronize the order in all processes.

Any PR would be appreciated😅

a-l-e-x-d-s-9 commented 11 months ago

@kohya-ss Thank you for your response. In ED2 the feature is pseudo random and based on seed used for training. I think it's good enough, and there is no need to force full rotation of images or sync between DataLoaders. I'll be happy to implement this feature myself, but I couldn't figure out how and when exactly the system is handling repeats - to switch them to portion of files in a folder. If you can point me to specific file or functions, I would appreciate it.

kohya-ss commented 11 months ago

Thank you for your suggestion. I agree that the full rotation and the synchronization are not needed, especially the dataset is small. However, if the dataset has 1,000 or more images, and we train the model with the dataset for several epochs, a certain number of images will not be used for the training.

On the other hand, the repeat count method forces to use all images in one epoch.

Therefore I think the repeat count method will be suitable to balance the different subjects.

a-l-e-x-d-s-9 commented 11 months ago

@kohya-ss There is a small misunderstanding that I want to clear up. The multiplier is an extension of the current repeat method; it doesn't change how the system handles full repeats, and users will be able to use it in the same way they do now. For example, with repeats of 2, the multiplier will be 2, and the system will add all images inside this folder for training exactly twice. There is no randomness in the integer part of the multiplier. The difference occurs only in the fractional part of the multiplier. So, for a multiplier of 2.5 - the integer 2 will add all images from the folder twice, and 0.5 will add only 50% additional images from the folder. The multiplier won't change system behavior or user experience for users who use repeats. But it will allow for much better subject balancing between multiple subjects in a way that's impossible now.

kohya-ss commented 11 months ago

Thanks for your comment. I may have misunderstood something.

Suppose the repeats is 2. if there are 3 images (image A to C), the current method prepares 6 images image-A (1), image-A (2), image-B (1), image-B (2), image-C (1), image-C (2) , and shuffles them and trains.

If the multiplier is 2.5, the number of images per epoch is 6*2.5=15. If we use random.choice(list_of_6_images) to select a particular image, the selection will be biased.

Therefore, if shown in pseudo code, would it be implemented as follows?

if index < num_of_images_with_repeats * int_part_of_multiplier:
    image = list_of_6_images[index % num_of_images_with_repeats]
else:
    image = random.choice(list_of_6_images)

I think this will work fine if the multiplier > 1.0, but if the multiplier < 1.0, the full rotations issue still remains.

a-l-e-x-d-s-9 commented 11 months ago

@kohya-ss I added a few examples below with a python code. The fraction part rotation is handled by a probabilistic selection of images every epoch, there is no need to enforce deterministic rotation. I want to explain how I used this feature in ED2 with 25 subjects, each subject had a different number of images, between 50 and 300 images for each subject. I calculated (with a script, but it can be done manually) a multiplier for each folder, so each subject will use only 25 images per epoch. That way no images would be discarded, and they all can participate in the training, also the balance between subjects is always kept. Unfortunately, it's not something that can be done with the current implementation of repeats in kohya-ss/sd-scripts.

Here are the prints from the code:

- Example of two consecutive epochs with the same folder, with multiplier 2.5, epochs 1, 2:
**************************************************
Multiplier: 2.5, sources: ['A1', 'B1', 'C1', 'D1']
Total images added due to integer part: 8
Additional images added due to fractional part: 2
Total images for training: 10
Images for training: ['A1', 'B1', 'C1', 'D1', 'A1', 'B1', 'C1', 'D1', 'D1', 'C1']
**************************************************
Multiplier: 2.5, sources: ['A1', 'B1', 'C1', 'D1']
Total images added due to integer part: 8
Additional images added due to fractional part: 2
Total images for training: 10
Images for training: ['A1', 'B1', 'C1', 'D1', 'A1', 'B1', 'C1', 'D1', 'D1', 'B1']
- Different folder, epoch 3:
**************************************************
Multiplier: 1.33, sources: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2']
Total images added due to integer part: 8
Additional images added due to fractional part: 3
Total images for training: 11
Images for training: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2', 'E2', 'C2', 'H2']
- Another folder with fractions only, epoch 5:
**************************************************
Multiplier: 0.5, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']
Total images added due to integer part: 0
Additional images added due to fractional part: 3
Total images for training: 3
Images for training: ['B3', 'A3', 'E3']
- Another folder with fractions only, epoch 10:
**************************************************
Multiplier: 0.33, sources: ['A4', 'B4', 'C4', 'D4', 'E4', 'F4', 'G4', 'H4']
Total images added due to integer part: 0
Additional images added due to fractional part: 3
Total images for training: 3
Images for training: ['C4', 'H4', 'G4']

Code:

import random

def choose_images_for_training_for_folder(files_list, multiplier, main_seed, epochs):
    # Create an isolated instance of random
    local_random = IsolatedRandom(main_seed + epochs)

    images_for_training = []

    # Split the multiplier into integer and fractional parts
    int_multiplier = int(multiplier)
    frac_multiplier = multiplier - int_multiplier

    # Add all images for the integer part of the multiplier
    for _ in range(int_multiplier):
        images_for_training.extend(files_list)

    # Randomly select a subset of images for the fractional part of the multiplier
    num_to_select = int(len(files_list) * frac_multiplier)
    additional_files = local_random.sample(files_list, num_to_select)

    # Probabilistically choose an extra image if the fractional multiplier results in a non-integer number of images
    if len(files_list) * frac_multiplier > num_to_select:
        additional_file = local_random.choice([file for file in files_list if file not in additional_files])
        additional_files.append(additional_file)

    images_for_training.extend(additional_files)

    # Print the results
    print("*" * 50)
    print(f"Multiplier: {multiplier}, sources: {files_list}")
    print(f"Total images added due to integer part: {len(files_list) * int_multiplier}")
    print(f"Additional images added due to fractional part: {len(additional_files)}")
    print(f"Total images for training: {len(images_for_training)}")
    print("Images for training:", images_for_training)

    return images_for_training

class IsolatedRandom:
    def __init__(self, seed=None):
        self._random = random.Random(seed)

    def seed(self, seed=None):
        self._random.seed(seed)

    def sample(self, population, k):
        return self._random.sample(population, k)

    def random(self):
        return self._random.random()

    def choice(self, seq):
        return self._random.choice(seq)

main_seed = 10

# Test with given examples and fixed seed and epoch
print("- Example of two consecutive epochs with the same folder, with multiplier 2.5, epochs 1, 2:")
choose_images_for_training_for_folder(["A1", "B1", "C1", "D1"], 2.5, main_seed, 1)
choose_images_for_training_for_folder(["A1", "B1", "C1", "D1"], 2.5, main_seed, 2)

print("- Different folder, epoch 3:")
choose_images_for_training_for_folder(["A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2"], 1.33, main_seed, 3)

print("- Another folder with fractions only, epoch 5:")
choose_images_for_training_for_folder(["A3", "B3", "C3", "D3", "E3", "F3"], 0.5, main_seed, 5)

print("- Another folder with fractions only, epoch 10:")
choose_images_for_training_for_folder(["A4", "B4", "C4", "D4", "E4", "F4", "G4", "H4"], 0.33, main_seed, 10)

kohya-ss commented 11 months ago

Thank you for the explanation. Maybe I am misunderstanding something, but it still remains unclear to me.

Multiplier: 0.5, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3'] Total images added due to integer part: 0 Additional images added due to fractional part: 3 Total images for training: 3 Images for training: ['B3', 'A3', 'E3']

Another folder with fractions only, epoch 10:

In this case, since all images are selected probabilistically, it could happen that one of the images is not trained all the time.

Although the number of steps per epoch is higher, I think the same distribution could be achieved with the following settings.

repeat count 15, sources: ['A1', 'B1', 'C1', 'D1']
repeat count 8, sources: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2']
repeat count 3, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']
repeat count 2, sources: ['A4', 'B4', 'C4', 'D4', 'E4', 'F4', 'G4', 'H4']

a-l-e-x-d-s-9 commented 11 months ago

@kohya-ss The chance that any particular image would not be trained at all with a multiplier as a fraction(only) is: (1-multiplier)^epochs. It can happen in certain cases, but it doesn't really matter, the important part is that balancing between subjects is enforced, and all subjects utilize most of their images, and the user can achieve uniform training between subjects easily. Using current repeats is not a very effective solution for subject balancing, because of a couple of reasons:

The pace of the learning per subject isn't controlled with a huge pool of repeats of different subjects pooled together. The distribution of subjects can be very uneven and chronic episodes of overtraining for certain subjects, undertraining for others, forgetting of a subject, and other edge cases, can occur when using a huge pool of unbalanced subjects. With a multiplier, it's easy to use 25 images per subject per epoch, and get even training for all subjects.
I gave an example of a model I actually trained with 25 subjects (link to model), also I've trained another model with 50 subjects (link to model). In both of them, I used the multiplier to balance between subjects in a very precise way, which allowed me to use a subject with arbitrary amounts of images without the need to discard any image. In my second model, each subject had between 50 and 800 images, and with 50 subjects, I don't think there is an easy way to calculate the repeats per subject, and a user still needs to go over all subjects and remove images to make it work with integer repeats. Using a multiplier, the calculation is very easy: multiplier=desired_images_per_epoch/images_per_subject.

kohya-ss commented 11 months ago

Thank you for clarification. I think I understand the advantage of the multiplier method.

It is true that when using the repeat count, the bias of the data within an epoch may not be negligible. I also agree that it is difficult to balance when there are many subjects (even it is rare).

One difficulty is aspect ratio bucketing.

Using the repeat count, the images within an epoch can be predefined, so each image is pre-sorted into one of the buckets. And the batch size number of images is trained from the bucket for a training step.

However, this is not possible with multiplier, because each image have different image size and cannot be assigned into the single batch.

Any ideas would be appreciated.

a-l-e-x-d-s-9 commented 11 months ago

@kohya-ss

Yes, with the introduction of random images per epoch, the bucketing per epoch will require changes. My idea of implementation consists of two stages:

The Initialization Stage: This is similar to the current bucketing. However, it only records to which bucket each image belongs and notes the multiplier of the image.
At the beginning of every epoch, images are added to buckets for training. Those with an integer multiplier part above 0 will always be added (according to number). The fractional part will be added based on a probabilistic decision.

To optimize the second step, we might use two lists: permanent and probabilistic. The permanent list is reserved for images with an integer part of the multiplier above 0. For example, images with a multiplier of 2.5 will always be trained twice per epoch, so we can keep them in the permanent list for each epoch's training. The probabilistic list will only contain images with a multiplier that has a fraction. So, if we only have images in the permanent list and none in the probabilistic list, the system will operate pretty much as it does now, with no runtime penalty for the user. However, if the user opts to use multipliers, the probabilistic list's images will be added to the training set based on the fraction of the multiplier at the start of an epoch, using a comparison with a random value, if the random value is less than the multiplier's fraction - image will be added.

As far as I understand, the implementation of ED2 is doing something similar now, and people are using it with hundreds of thousands of images, so it's a valid solution. Regular users, with a few thousands of images, definitely wouldn't notice any changes in performance.

kohya-ss commented 11 months ago

Thanks for the suggestion. Sorry for the delay in responding to you.

I understand that the re-sorting images to each bucket in each epoch. I think the idea works.

One issue is that the DataLoader runs in a multi-process manner, and all processes must have same buckets for each epoch, but this can probably be solved by sharing the random seed in advance as already implemented.

The real problem is that it would take a lot of effort to implement. However, I personally feel that other tasks have higher priority.

So, if I do implement the feature, it will be some time in the future. I appreciate your understanding.

If you are interested in creating a PR, you can find the bucketing code at: https://github.com/kohya-ss/sd-scripts/blob/2d87bb648f30adab00ceb38a0da786cd548d5ce7/library/train_util.py#L730

This process is called before the start of the training.

We also handle when the epoch changes at the following location. https://github.com/kohya-ss/sd-scripts/blob/2d87bb648f30adab00ceb38a0da786cd548d5ce7/library/train_util.py#L582

Instead of having specific information about the buckets in the BucketManager, we may have only the image information and recreate the buckets when the epoch changes. Multiplier could be specified in the .toml file.

kohya-ss / sd-scripts

[Feature Request] Multiplier-Based Image Selection Over Repeat Counts - Improved Subject Balancing #826