Open a-l-e-x-d-s-9 opened 12 months ago
Thanks for the suggestion! The feature is very interesting.
However, it will take a lot of work to implement. While it is a random selection, it would not be desirable to be completely random. Because some of images might not be selected for long time in training. Therefore, it would require selecting images from each folder in a pre-shuffled order and at a specified ratio, to enumerate all images.
Also, since DataLoader runs in a multi-process fashion, we would have to synchronize the order in all processes.
Any PR would be appreciated😅
@kohya-ss Thank you for your response. In ED2 the feature is pseudo random and based on seed used for training. I think it's good enough, and there is no need to force full rotation of images or sync between DataLoaders. I'll be happy to implement this feature myself, but I couldn't figure out how and when exactly the system is handling repeats - to switch them to portion of files in a folder. If you can point me to specific file or functions, I would appreciate it.
Thank you for your suggestion. I agree that the full rotation and the synchronization are not needed, especially the dataset is small. However, if the dataset has 1,000 or more images, and we train the model with the dataset for several epochs, a certain number of images will not be used for the training.
On the other hand, the repeat count method forces to use all images in one epoch.
Therefore I think the repeat count method will be suitable to balance the different subjects.
@kohya-ss There is a small misunderstanding that I want to clear up. The multiplier is an extension of the current repeat method; it doesn't change how the system handles full repeats, and users will be able to use it in the same way they do now. For example, with repeats of 2, the multiplier will be 2, and the system will add all images inside this folder for training exactly twice. There is no randomness in the integer part of the multiplier. The difference occurs only in the fractional part of the multiplier. So, for a multiplier of 2.5 - the integer 2 will add all images from the folder twice, and 0.5 will add only 50% additional images from the folder. The multiplier won't change system behavior or user experience for users who use repeats. But it will allow for much better subject balancing between multiple subjects in a way that's impossible now.
Thanks for your comment. I may have misunderstood something.
Suppose the repeats is 2. if there are 3 images (image A to C), the current method prepares 6 images image-A (1), image-A (2), image-B (1), image-B (2), image-C (1), image-C (2)
, and shuffles them and trains.
If the multiplier is 2.5, the number of images per epoch is 6*2.5=15
. If we use random.choice(list_of_6_images)
to select a particular image, the selection will be biased.
Therefore, if shown in pseudo code, would it be implemented as follows?
if index < num_of_images_with_repeats * int_part_of_multiplier:
image = list_of_6_images[index % num_of_images_with_repeats]
else:
image = random.choice(list_of_6_images)
I think this will work fine if the multiplier > 1.0, but if the multiplier < 1.0, the full rotations issue still remains.
@kohya-ss I added a few examples below with a python code. The fraction part rotation is handled by a probabilistic selection of images every epoch, there is no need to enforce deterministic rotation. I want to explain how I used this feature in ED2 with 25 subjects, each subject had a different number of images, between 50 and 300 images for each subject. I calculated (with a script, but it can be done manually) a multiplier for each folder, so each subject will use only 25 images per epoch. That way no images would be discarded, and they all can participate in the training, also the balance between subjects is always kept. Unfortunately, it's not something that can be done with the current implementation of repeats in kohya-ss/sd-scripts.
Here are the prints from the code:
- Example of two consecutive epochs with the same folder, with multiplier 2.5, epochs 1, 2:
**************************************************
Multiplier: 2.5, sources: ['A1', 'B1', 'C1', 'D1']
Total images added due to integer part: 8
Additional images added due to fractional part: 2
Total images for training: 10
Images for training: ['A1', 'B1', 'C1', 'D1', 'A1', 'B1', 'C1', 'D1', 'D1', 'C1']
**************************************************
Multiplier: 2.5, sources: ['A1', 'B1', 'C1', 'D1']
Total images added due to integer part: 8
Additional images added due to fractional part: 2
Total images for training: 10
Images for training: ['A1', 'B1', 'C1', 'D1', 'A1', 'B1', 'C1', 'D1', 'D1', 'B1']
- Different folder, epoch 3:
**************************************************
Multiplier: 1.33, sources: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2']
Total images added due to integer part: 8
Additional images added due to fractional part: 3
Total images for training: 11
Images for training: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2', 'E2', 'C2', 'H2']
- Another folder with fractions only, epoch 5:
**************************************************
Multiplier: 0.5, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']
Total images added due to integer part: 0
Additional images added due to fractional part: 3
Total images for training: 3
Images for training: ['B3', 'A3', 'E3']
- Another folder with fractions only, epoch 10:
**************************************************
Multiplier: 0.33, sources: ['A4', 'B4', 'C4', 'D4', 'E4', 'F4', 'G4', 'H4']
Total images added due to integer part: 0
Additional images added due to fractional part: 3
Total images for training: 3
Images for training: ['C4', 'H4', 'G4']
Code:
import random
def choose_images_for_training_for_folder(files_list, multiplier, main_seed, epochs):
# Create an isolated instance of random
local_random = IsolatedRandom(main_seed + epochs)
images_for_training = []
# Split the multiplier into integer and fractional parts
int_multiplier = int(multiplier)
frac_multiplier = multiplier - int_multiplier
# Add all images for the integer part of the multiplier
for _ in range(int_multiplier):
images_for_training.extend(files_list)
# Randomly select a subset of images for the fractional part of the multiplier
num_to_select = int(len(files_list) * frac_multiplier)
additional_files = local_random.sample(files_list, num_to_select)
# Probabilistically choose an extra image if the fractional multiplier results in a non-integer number of images
if len(files_list) * frac_multiplier > num_to_select:
additional_file = local_random.choice([file for file in files_list if file not in additional_files])
additional_files.append(additional_file)
images_for_training.extend(additional_files)
# Print the results
print("*" * 50)
print(f"Multiplier: {multiplier}, sources: {files_list}")
print(f"Total images added due to integer part: {len(files_list) * int_multiplier}")
print(f"Additional images added due to fractional part: {len(additional_files)}")
print(f"Total images for training: {len(images_for_training)}")
print("Images for training:", images_for_training)
return images_for_training
class IsolatedRandom:
def __init__(self, seed=None):
self._random = random.Random(seed)
def seed(self, seed=None):
self._random.seed(seed)
def sample(self, population, k):
return self._random.sample(population, k)
def random(self):
return self._random.random()
def choice(self, seq):
return self._random.choice(seq)
main_seed = 10
# Test with given examples and fixed seed and epoch
print("- Example of two consecutive epochs with the same folder, with multiplier 2.5, epochs 1, 2:")
choose_images_for_training_for_folder(["A1", "B1", "C1", "D1"], 2.5, main_seed, 1)
choose_images_for_training_for_folder(["A1", "B1", "C1", "D1"], 2.5, main_seed, 2)
print("- Different folder, epoch 3:")
choose_images_for_training_for_folder(["A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2"], 1.33, main_seed, 3)
print("- Another folder with fractions only, epoch 5:")
choose_images_for_training_for_folder(["A3", "B3", "C3", "D3", "E3", "F3"], 0.5, main_seed, 5)
print("- Another folder with fractions only, epoch 10:")
choose_images_for_training_for_folder(["A4", "B4", "C4", "D4", "E4", "F4", "G4", "H4"], 0.33, main_seed, 10)
Thank you for the explanation. Maybe I am misunderstanding something, but it still remains unclear to me.
Multiplier: 0.5, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3'] Total images added due to integer part: 0 Additional images added due to fractional part: 3 Total images for training: 3 Images for training: ['B3', 'A3', 'E3']
- Another folder with fractions only, epoch 10:
In this case, since all images are selected probabilistically, it could happen that one of the images is not trained all the time.
Although the number of steps per epoch is higher, I think the same distribution could be achieved with the following settings.
repeat count 15, sources: ['A1', 'B1', 'C1', 'D1']
repeat count 8, sources: ['A2', 'B2', 'C2', 'D2', 'E2', 'F2', 'G2', 'H2']
repeat count 3, sources: ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']
repeat count 2, sources: ['A4', 'B4', 'C4', 'D4', 'E4', 'F4', 'G4', 'H4']
@kohya-ss The chance that any particular image would not be trained at all with a multiplier as a fraction(only) is: (1-multiplier)^epochs. It can happen in certain cases, but it doesn't really matter, the important part is that balancing between subjects is enforced, and all subjects utilize most of their images, and the user can achieve uniform training between subjects easily. Using current repeats is not a very effective solution for subject balancing, because of a couple of reasons:
Thank you for clarification. I think I understand the advantage of the multiplier method.
It is true that when using the repeat count, the bias of the data within an epoch may not be negligible. I also agree that it is difficult to balance when there are many subjects (even it is rare).
One difficulty is aspect ratio bucketing.
Using the repeat count, the images within an epoch can be predefined, so each image is pre-sorted into one of the buckets. And the batch size number of images is trained from the bucket for a training step.
However, this is not possible with multiplier, because each image have different image size and cannot be assigned into the single batch.
Any ideas would be appreciated.
@kohya-ss
Yes, with the introduction of random images per epoch, the bucketing per epoch will require changes. My idea of implementation consists of two stages:
To optimize the second step, we might use two lists: permanent and probabilistic. The permanent list is reserved for images with an integer part of the multiplier above 0. For example, images with a multiplier of 2.5 will always be trained twice per epoch, so we can keep them in the permanent list for each epoch's training. The probabilistic list will only contain images with a multiplier that has a fraction. So, if we only have images in the permanent list and none in the probabilistic list, the system will operate pretty much as it does now, with no runtime penalty for the user. However, if the user opts to use multipliers, the probabilistic list's images will be added to the training set based on the fraction of the multiplier at the start of an epoch, using a comparison with a random value, if the random value is less than the multiplier's fraction - image will be added.
As far as I understand, the implementation of ED2 is doing something similar now, and people are using it with hundreds of thousands of images, so it's a valid solution. Regular users, with a few thousands of images, definitely wouldn't notice any changes in performance.
Thanks for the suggestion. Sorry for the delay in responding to you.
I understand that the re-sorting images to each bucket in each epoch. I think the idea works.
One issue is that the DataLoader runs in a multi-process manner, and all processes must have same buckets for each epoch, but this can probably be solved by sharing the random seed in advance as already implemented.
The real problem is that it would take a lot of effort to implement. However, I personally feel that other tasks have higher priority.
So, if I do implement the feature, it will be some time in the future. I appreciate your understanding.
If you are interested in creating a PR, you can find the bucketing code at: https://github.com/kohya-ss/sd-scripts/blob/2d87bb648f30adab00ceb38a0da786cd548d5ce7/library/train_util.py#L730
This process is called before the start of the training.
We also handle when the epoch changes at the following location. https://github.com/kohya-ss/sd-scripts/blob/2d87bb648f30adab00ceb38a0da786cd548d5ce7/library/train_util.py#L582
Instead of having specific information about the buckets in the BucketManager, we may have only the image information and recreate the buckets when the epoch changes. Multiplier could be specified in the .toml file.
I'm used to EveryDream2 trainer, which gives users the option to select a portion of images from a folder using a multiplier instead of a repeat count. For instance, by applying a 0.5 multiplier to a folder, the trainer will randomly select half of the images for each epoch. This system isn't limited to fractions; users can set multipliers such as 1.6 or any other positive value. On the other hand, using a negative multiplier will exclude the folder.
This multiplier approach retains the repeat functionality users are accustomed to while offering a significant improvement in balancing training across different subjects without having to delete any images.
To illustrate, consider three subjects: A with 1000 images, B with 200 images, and C with 100 images. If the goal is to train on 25 images per epoch, we can set multipliers of 0.025 for A, 0.125 for B, and 0.25 for C. This eliminates the need to delete images or train on all 1000 images from subject A. This method effectively balances subjects, especially when the number of images in their datasets varies. As the number and diversity of subjects increase, this approach provides better control over training, avoiding the need to discard images.
I believe that adding this feature to Kohya is essential.