ostris / ai-toolkit

Various AI scripts. Mostly Stable Diffusion stuff.
MIT License
2.64k stars 254 forks source link

How to train a single Lora with multiple people triggered by their names? #108

Closed davidmartinrius closed 3 weeks ago

davidmartinrius commented 3 weeks ago

Hi there! I was wondering if it's possible to train a single LORA model to recognize and generate multiple specific faces or bodies of specific persons. For example, could one LORA model be used to generate both my own face and the faces of others based on their names? How to manage this with the trigger words? I have a single dataset with all people tagged by its name and a short caption in the .txt files

NBSTpeterhill commented 3 weeks ago

please have some tries, and I mark here to wait an answer.

davidmartinrius commented 3 weeks ago

Hi, sorry, I don't get it @NBSTpeterhill what do you mean?

NBSTpeterhill commented 3 weeks ago

I also expect to get an answer, because many training parameters of ai-toolkit are not actually fully open. So, I am curious too.

bghira commented 3 weeks ago

like this: https://huggingface.co/ptx0/flux-dreambooth-lora-r16-dev

martintomov commented 3 weeks ago

For example, could one LORA model be used to generate both my own face and the faces of others based on their names? How to manage this with the trigger words?

You can train multiple triggers, each with its own set of images. The key is in the captions. Simply caption enough images of yourself using the trigger word associated with your face, and caption other images with the trigger word linked to whatever else you want to train. Repeat this for as much as wanted and find the limit, I think it should work.

davidmartinrius commented 3 weeks ago

You mean the trigger_word from .yaml file?

So I should train a LoRA with a set of images of a person and a trigger. Then, do a fine-tuning of this LoRA with another set of images and another trigger, and so on? Doing fine-tunings repeatedly? Or can I train the model all at once with all the images and set up multiple triggers at the same time?

martintomov commented 3 weeks ago

Or can I train the model all at once with all the images and set up multiple triggers at the same time?

this.

I’m going to run a quick training to see if it works, but I suggest you give it a shot too. Try something simple, like using your face and a friend’s face. Give them names like “David” and “Martin.” Once you’ve got the LoRA trained, try a prompt like “Portrait of David and Martin standing next to each other... etc” and see how it turns out.

davidmartinrius commented 3 weeks ago

ok, the yaml file only allows one trigger word. So, how to specify multiple in the same training?

I already have a dataset like people/David/image1.txt -> David in a suit and tie standing in front of a flag people/David/image1.jpg ... people/Martin/image1.txt -> Martin standing in front of a building people/Martin/image1.jpg ... people/PersonN/imageN.txt -> Name N doing something and wearing something bla bla people/PersonN/imageN.jpg ...

And so on, each person has 100 images with its corresponding captions and each caption has the name of that person.

I can join all the images and captions into a single folder, instead of having a folder for each one for a better suitability

How to specify multiple triggers when training this?

martintomov commented 3 weeks ago

Give this a try:

  1. Combine all images and captions into a single dataset folder.
  2. Set the trigger word to something like “photograph” or “photograph style.”
  3. For testing, use a learning rate of 4e-3 and set the steps to 1000. (This is just for a quick validation to see if the idea works. Later, you should lower the learning rate and increase the steps for better results.)

Edit:

  1. Also, replace the default sample prompts with ones like “Photograph of David and Martin standing next to each other…” or “Photograph of David and Martin eating cake at a restaurant…” etc. so you can monitor results during training. I find this really helpful for my use cases.
davidmartinrius commented 3 weeks ago

Great @martintomov I am going to try it! You have clarified many doubts for me in a moment. Thank you so much!!

martintomov commented 3 weeks ago

No worries. I hope it works out well. I’m curious too, so I’ll share my results using this technique tomorrow.

martintomov commented 3 weeks ago

@davidmartinrius, here’s what I found—results are a bit mixed:

I ran a quick test with 1000 steps and a learning rate of 4e-3 to see if this would work. The dataset had 8 images and captions for Person 1 and another 8 for Person 2. I gave each person a specific name in the dataset and manually captioned the images. Trigger word: Photograph

Person 1:

Person 2:

Results

Prompting each person separately:

Person 1 output:

Person 2 output:

Prompting them together:

Output 1:

Output 2:

Output 3:

Conclusion

It's pretty interesting that prompting each person separately works flawlessly, but it struggles when you combine them. This might be due to the small dataset, too few training steps, or the learning rate being a bit too high. I’d suggest trying it again with a larger dataset, a learning rate of 1e-4, and around 4000 steps (or more?) to see if that improves things. Also, check out bghira’s model on Hugging Face—he shared a link in a comment above, and the results are worth a look: Hugging Face - ptx0/flux-dreambooth-lora-r16-dev.

Another approach to consider is LoRA Merging. I’ve found it to be pretty powerful, though I’m still exploring it:

Trained model on this dress:

Trained model on this face:

LoRA Merging output:

davidmartinrius commented 3 weeks ago

I got bad results and I don't know why... maybe you could help me with this.

I used a dataset of 3 persons. Each person has between 80-100 images. When training (at 70%) I just got blurry/noisy images like this one:

1724112815056__000000750_4

This is my config yaml. Trained on a RTX A5000 (24GB vram)

---
job: extension
config:
  # this name will be the folder and filename name
  name: "my_first_flux_lora_v1"
  process:
    - type: 'sd_trainer'
      # root folder to save training sessions/samples/weights
      training_folder: "output"
      # uncomment to see performance stats in the terminal every N steps
#      performance_log_every: 1000
      device: cuda:0
      # if a trigger word is specified, it will be added to captions of training data if it does not already exist
      # alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word
      trigger_word: "photograph"
      network:
        type: "lora"
        linear: 16
        linear_alpha: 16
      save:
        dtype: float16 # precision to save
        save_every: 250 # save every this many steps
        max_step_saves_to_keep: 4 # how many intermittent saves to keep
      datasets:
        # datasets are a folder of images. captions need to be txt files with the same name as the image
        # for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently
        # images will automatically be resized and bucketed into the resolution specified
        # on windows, escape back slashes with another backslash so
        # "C:\\path\\to\\images\\folder"
        - folder_path: "/workspace/ai-toolkit/dataset"
          caption_ext: "txt"
          caption_dropout_rate: 0.05  # will drop out the caption 5% of time
          shuffle_tokens: false  # shuffle caption order, split by commas
          cache_latents_to_disk: true  # leave this true unless you know what you're doing
          resolution: [ 512, 768, 1024 ]  # flux enjoys multiple resolutions
      train:
        batch_size: 1
        steps: 1000  # total number of steps to train 500 - 4000 is a good range
        gradient_accumulation_steps: 1
        train_unet: true
        train_text_encoder: false  # probably won't work with flux
        gradient_checkpointing: true  # need the on unless you have a ton of vram
        noise_scheduler: "flowmatch" # for training only
        optimizer: "adamw8bit"
        lr: 4e-3
        # uncomment this to skip the pre training sample
#        skip_first_sample: true
        # uncomment to completely disable sampling
#        disable_sampling: true
        # uncomment to use new vell curved weighting. Experimental but may produce better results
#        linear_timesteps: true

        # ema will smooth out learning, but could slow it down. Recommended to leave on.
        ema_config:
          use_ema: true
          ema_decay: 0.99

        # will probably need this if gpu supports it for flux, other dtypes may not work correctly
        dtype: bf16
      model:
        # huggingface model name or path
        name_or_path: "black-forest-labs/FLUX.1-dev"
        is_flux: true
        quantize: true  # run 8bit mixed precision
#        low_vram: true  # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.
      sample:
        sampler: "flowmatch" # must match train.noise_scheduler
        sample_every: 250 # sample every this many steps
        width: 1024
        height: 1024
        prompts:
          # you can add [trigger] to the prompts here and it will be replaced with the trigger word
#          - "[trigger] holding a sign that says 'I LOVE PROMPTS!'"\
          - "a [trigger] of person1 with red hair, playing chess at the park, bomb going off in the background"
          - "a [trigger] of person2 holding a coffee cup, in a beanie, sitting at a cafe"
          - "a [trigger] of person3 as a DJ at a night club, fish eye lens, smoke machine, lazer lights, holding a martini"
          - "Photograph of person1 and person3 eating cake at a restaurant"
          - "a [trigger] of person2 with a bear disguise in a log cabin in the snow covered mountains"
          - "person3 playing the guitar, on stage, singing a song, laser lights, punk rocker"
          - "person2 with a beard, building a chair, in a wood shop"
          - "photograph of person1, white background, medium shot, modeling clothing, studio lighting, white backdrop"
          - "a photograph of person2 holding a sign that says, 'I am the fucking king' as a gangsta in LA"
          - "a [trigger] of person1, in a post apocalyptic world, with a shotgun, in a leather jacket, in a desert, with a motorcycle"
        neg: ""  # not used on flux
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 20
# you can add any additional meta info here. [name] is replaced with config name at top
meta:
  name: "[name]"
  version: '1.0'
WarAnakin commented 3 weeks ago

Great @martintomov I am going to try it! You have clarified many doubts for me in a moment. Thank you so much!!

you can also make subfolders in the main training folder where each subfolder name is the trigger of the sub-category. That would be the proper way of doing thins, tho' Im not sure how well the scripts have been optimized for such a thing, I'd have to check.

martintomov commented 3 weeks ago

you can also make subfolders in the main training folder where each subfolder name is the trigger of the sub-category.

sounds promising, have you tried this approach?

WarAnakin commented 3 weeks ago

you can also make subfolders in the main training folder where each subfolder name is the trigger of the sub-category.

sounds promising, have you tried this approach?

yes i have, otherwise i wouldn't have suggested it. One thing to watch out for is the number of images in the folder to be similar. Too much of a discrepancy and it might learn one thing better than the other.

kilimchoi commented 3 weeks ago

you can also make subfolders in the main training folder where each subfolder name is the trigger of the sub-category.

sounds promising, have you tried this approach?

yes i have, otherwise i wouldn't have suggested it. One thing to watch out for is the number of images in the folder to be similar. Too much of a discrepancy and it might learn one thing better than the other.

do we need to change anything in .yaml if we want to do this? or would it still look like

 datasets:
        - folder_path: "/workspace/ai-toolkit/dataset"

where dataset folder contains two folders like model_a_training_data, model_b_training_data?

barnaclejive commented 3 weeks ago

Have you considered making loras for each person individually so they have their own trigger? Then use both loras when generating? Otherwise I'm not sure your question is specific to this project, it is more of general training question. I suspect you are going to have a hard time training on images of multiple people unless you can create really good caption files that make it clear who is who, and figure out if that actually works. Yes, you could train the entire set of images that contain both people and set the trigger to something more general like "photo" (with good captions), but I feel like you are going to get worse results than making multiple models/loras and combining them in some way during generation.

barnaclejive commented 3 weeks ago

@WarAnakin

you can also make subfolders in the main training folder where each subfolder name is the trigger of the sub-category. That would be the proper way of doing thins, tho' Im not sure how well the scripts have been optimized for such a thing, I'd have to check.

This is supported and works? What do you set for the trigger_word then? Does it get ignored or somehow combined with the subfolder names? I haven't seen any documentation about it magically using subfolder names as trigger words that override the trigger_word in the config.

That would be the proper way of doing thins, tho' Im not sure how well the scripts have been optimized for such a thing

Maybe that would be the "proper" way, but is it actually supported in real life? I feel like the scripts either support it or don't, it probably isn't a matter of scripts being "optimized" for it. What are you basing this approach on? Did you find code that suggests this is a feature?

I'd have to check

So... did you check? I know you say you did this and it worked or you would not have suggested it, but it is not clear what you are basing this on other than maybe you tried something and assumed it is doing what you think it would.

One thing to watch out for is the number of images in the folder to be similar. Too much of a discrepancy and it might learn one thing better than the other.

TBH, it sounds like you just ended up training on the whole folder (with subfolders) and subfolder names had no impact, and that is why you have mixed results.

I'm not saying you're wrong, it is just very unclear where this advice is coming from so I'm skeptical. I would assume if this was an actual thing it does it would have been documented somewhere.

Are you just wishcasting this feature or can you point to some code that shows it is an actual feature?

WarAnakin commented 3 weeks ago

@barnaclejive

Before using this tool, I used kohya, therefore I have a tendency to organize my folders in kohya's respective format ("repeats + instance + class." eg. "7_shakir tiger"). I have trained hundreds of different things for different purposes with datasets of all sorts and shapes. For example, everything you see on https://logodiffusion.com is running off of the models and loras i trained for that client. Juggernaut SDXL is running off of the realistic base I trained for https://rundiffusion.com. I'm not saying this with the intent to brag, but to give you an idea about the complexity of the datasets i usually work with.

When I said that you can use multiple subfolders in the main folder, it's not an accident, i've done that on purpose to check whether it works or breaks. Now the number of repeats don't work, this is not a feature currently supported in this tool.

Man, I feel like your post is more of an interrogation than anything else.

Maybe that would be the "proper" way, but is it actually supported in real life?

Yes, it should be a proper thing for the folder name to represent the unique trigger for that sub-category, otherwise you can just throw all the images and their respective captions in one folder and call it a day.

So... did you check? I know you say you did this and it worked or you would not have suggested it, but it is not clear what you are basing this on other than maybe you tried something and assumed it is doing what you think it would.

If i would've had the answer already, i wouldn't have said that i'd have to check in the 1st place.

TBH, it sounds like you just ended up training on the whole folder (with subfolders) and subfolder names had no impact, and that is why you have mixed results.

Yes, I trained on a folder with folder subsets and it correctly reported the total number of images without throwing an error and just proceeded as normal. Since it fetches the numbers from each folder, yes, at this point it is a matter of just further optimizing the scripts to make it perform the additional tasks because other than that the system doesn't mention anything now does it ?

TBH, it sounds like you just ended up training on the whole folder (with subfolders) and subfolder names had no impact, and that is why you have mixed results.

What I mentioned there has to do with balancing out the steps when you are training on more than one concept, in order to control overfitting, hence why the number of repeats helps.

Are you just wishcasting this feature or can you point to some code that shows it is an actual feature?

Are you just wishcasting, blah blah blah, even if I was to wish cast this, yes, this would be a helpful feature now, wouldn't it ?!

So... did you check? < This is supported and works? < Does it get ignored or somehow combined with the subfolder names? < can you point to some code that shows it is an actual feature?<

I was about to, but instead here I am with my back against the wall answering questions as if I owe any1 anything.

Jokes aside, you know very well we, people, do this as a hobby and lots of us have quite a lot to deal with. I wish there was 36hrs in a day and that i had 4 arms, but I'm just a human. The only reason I took the time to write this long ass post is because so did you to write the one you made. And no, other than the number of images this is all we get: image

Now, I am going to look at the code and see what I can do. I've forked this repository so in case I make changes you will be able to see them (i already implemented polynomial as a scheduler), but I will come post here as well.

barnaclejive commented 3 weeks ago

@WarAnakin

I'm not sure what your experience in other tools has to do with the question about this project at this time and in regard to the OP question.

So... is this feature, in this particular project, at this time, supported or not? That is really all the matters.

Yes, I trained on a folder with folder subsets and it correctly reported the total number of images without throwing an error and just proceeded as normal. Since it fetches the numbers from each folder

Yes, it probably does find all the images in the subfolders and knows how many there are in each. That isn't the question. What it does with that is what matters.

at this point it is a matter of just further optimizing the scripts to make it perform the additional tasks because other than that the system doesn't mention anything now does it ?

"it is matter of just..."? I will take that as a no. No, the system does not mention anything else. That was literally the OP question though and I think you have answered it now.

If your suggestion doesn't do what the OP asked, I'm not sure what the goal of your comment was. The way you presented your answer was as if this would do what the OP wanted, not "it could work but doesn't unless..."

even if I was to wish cast this, yes, this would be a helpful feature now, wouldn't it ?!

Yes! this would be a great feature. It would be awesome if it behaved like other tools in this way. I hope you will make it work and get it merged into the project.

My only objection is that you presented a solution to a question and your solution isn't a real thing at this time.

davidmartinrius commented 3 weeks ago

@WarAnakin I don't see any clarification of how to make it work properly. Please, as you are an expert, could you explain step by step how to train it with a dataset with multiple persons + an appropiate yaml configuration?

Thank you!

kilimchoi commented 3 weeks ago

@barnaclejive I tried out as @WarAnakin suggested and it does not work.

davidmartinrius commented 3 weeks ago

@barnaclejive I tried out as @WarAnakin suggested and it does not work.

What did you try? What dataset? What configuration? This says nothing too

kilimchoi commented 3 weeks ago

@barnaclejive I tried out as @WarAnakin suggested and it does not work.

What did you try? What dataset? What configuration? This says nothing too

essentially I created two folders (face, style) under the main dataset folder. in the .yaml file, I added

datasets:
        - folder_path: "/workspace/ai-toolkit/dataset"

For each text file in the face folder, I added person1. For each text file in the style folder, I added in photography style. Since it was not clear how to add multiple trigger words, I thought this was the only way to make it obvious that these were the trigger words.

davidmartinrius commented 3 weeks ago

How many images per folder? What resolutions? How many steps? Learning rate?

I trained a model and it works when using folder names as triggers. People look 80% alike in my first training. I used all the default .yaml data except steps which was 3000. Since it was pretty close to the desired result I have now trained it up to 6000 steps. I trained it with 3 different persons

kilimchoi commented 3 weeks ago

How many images per folder? What resolutions? How many steps? Learning rate?

I trained a model and it works when using folder names as triggers. People look 80% alike in my first training. I used all the default .yaml data except steps which was 3000. Since it was pretty close to the desired result I have now trained it up to 6000 steps. I trained it with 3 different persons

around 32 images in both folders. 1024x1024. 2000 steps and 4e-4. how do you use folder names as triggers?

davidmartinrius commented 3 weeks ago

ok, I used 100 images for each person and a learning rate of 1e-4 (the default of the yaml file)

I didn't set any trigger in the yaml file, just did as @WarAnakin said: each subfolder name is the trigger of the sub-category.

kilimchoi commented 3 weeks ago

ok, I used 100 images for each person and a learning rate of 1e-4 (the default of the yaml file)

I didn't set any trigger in the yaml file, just did as @WarAnakin said: each subfolder name is the trigger of the sub-category.

so it automatically knows to set the trigger of the sub-category based on the subfolder name?

davidmartinrius commented 3 weeks ago

Yes, exactly. But I also defined in each caption the name of the person. Like this: "A photo of person1, wearing a suit, bla bla" (separated by comma, the name if the first part always)

I am not sure if it is the appropiate way, but it worked

kilimchoi commented 3 weeks ago

Yes, exactly. But I also defined in each caption the name of the person. Like this: "A photo of person1, wearing a suit, bla bla" (separated by comma, the name if the first part always)

I am not sure if it is the appropiate way, but it worked

so you named the folder "person1" and added person1 in the captions right?

davidmartinrius commented 3 weeks ago

Exactly, and actually I don't know if it is working because of the captions , because of the folder name or both. But I did this way , yes

WarAnakin commented 2 weeks ago

Hi guys, apologies for the delay, i had a bit of an eventful weekend. I will be able to take a better look at this tomorrow.

Exactly, and actually I don't know if it is working because of the captions , because of the folder name or both. But I did this way , yes <

It's working because of the captions, i have done the same experiment as you. Although, it's not doing too much of a good job differentiating between multiple people and this is due to the text encoder not being trained.

Maybe, if it's possible, Ostris would consider to enable the clip_l to be trained in order to alleviate some of these issues.

NBSTpeterhill commented 2 weeks ago

T5 clip train use much more VRAM and time to really make useful, but always it will overfitting. so, we should choose not use t5 train, or mark some blocks to train t5. and I found "a male/female name is XXX" is better than "XXX, a male/female", in this method, XXX will not work or less work to be detected.