ostris / ai-toolkit

Various AI scripts. Mostly Stable Diffusion stuff.
MIT License
2.97k stars 286 forks source link

The slider does not work with SDXL. #19

Closed KappaMund closed 1 month ago

KappaMund commented 1 year ago

I have already trained 3 different LoRA-slider with different settings of positive and negative, but the sliders are obtained for detail, and not for what I trained.

Hair Style Slider:

Fitness Level Slider:

Technological Advancement Slider:

KappaMund commented 1 year ago

Example of the settings:

---
 This is in yaml format. You can use json if you prefer
 I like both but yaml is easier to write
 Plus it has comments which is nice for documentation
 This is the config I use on my sliders, It is solid and tested
job: train
config:
   the name will be used to create a folder in the output folder
   it will also replace any [name] token in the rest of this config
  name: fitness_level_slider
   folder will be created with name above in folder below
   it can be relative to the project root or absolute
  training_folder: "output/LoRA"
  device: cuda:1  cpu, cuda:0, etc
   for tensorboard logging, we will make a subfolder for this job
  log_dir: "output/.tensorboard"
   you can stack processes for other jobs, It is not tested with sliders though
   just use one for now
  process:
    - type: slider  tells runner to run the slider process
       network is the LoRA network for a slider, I recommend to leave this be
      network:
         network type lierla is traditional LoRA that works everywhere, only linear layers
        type: "lierla"
         rank / dim of the network. Bigger is not always better. Especially for sliders. 8 is good
        linear: 8
        linear_alpha: 4  Do about half of rank
       training config
      train:
         this is also used in sampling. Stick with ddpm unless you know what you are doing
        noise_scheduler: "ddpm"  or "ddpm", "lms", "euler_a"
         how many steps to train. More is not always better. I rarely go over 1000
        steps: 500
         I have had good results with 4e-4 to 1e-4 at 500 steps
        lr: 2e-4
         enables gradient checkpoint, saves vram, leave it on
        gradient_checkpointing: true
         train the unet. I recommend leaving this true
        train_unet: true
         train the text encoder. I don't recommend this unless you have a special use case
         for sliders we are adjusting representation of the concept (unet),
         not the description of it (text encoder)
        train_text_encoder: false
         same as from sd-scripts, not fully tested but should speed up training
        min_snr_gamma: 5.0
         just leave unless you know what you are doing
         also supports "dadaptation" but set lr to 1 if you use that,
         but it learns too fast and I don't recommend it
        optimizer: "adamw"
         only constant for now
        lr_scheduler: "constant"
         we randomly denoise random num of steps form 1 to this number
         while training. Just leave it
        max_denoising_steps: 40
         works great at 1. I do 1 even with my 4090.
         higher may not work right with newer single batch stacking code anyway
        batch_size: 1
         bf16 works best if your GPU supports it (modern)
        dtype: bf16   fp32, bf16, fp16
         if you have it, use it. It is faster and better
         torch 2.0 doesnt need xformers anymore, only use if you have lower version
        xformers: true
         I don't recommend using unless you are trying to make a darker lora. Then do 0.1 MAX
         although, the way we train sliders is comparative, so it probably won't work anyway
        noise_offset: 0.0
        noise_offset: 0.0357   SDXL was trained with offset of 0.0357. So use that when training on SDXL

       the model to train the LoRA network on
      model:
         huggingface name, relative prom project path, or absolute path to .safetensors or .ckpt
        name_or_path: "C:/!NeuralNetwork/Downloader/Downloader_Hug/Downloaded models/SDXL/SDXL
        Base 1.0.safetensors"
        is_v2: false   for v2 models
        is_v_pred: false  for v-prediction models (most v2 models)
         has some issues with the dual text encoder and the way we train sliders
         it works bit weights need to probably be higher to see it.
        is_xl: true   for SDXL models

       saving config
      save:
        dtype: float16  precision to save. I recommend float16
        save_every: 100  save every this many steps
         this will remove step counts more than this number
         allows you to save more often in case of a crash without filling up your drive
        max_step_saves_to_keep: 2

       sampling config
      sample:
         must match train.noise_scheduler, this is not used here
         but may be in future and in other processes
        sampler: "ddpm"
         sample every this many steps
        sample_every: 100
         image size
        width: 1024
        height: 1024
         prompts to use for sampling. Do as many as you want, but it slows down training
         pick ones that will best represent the concept you are trying to adjust
         allows some flags after the prompt
          --m [number]   network multiplier. LoRA weight. -3 for the negative slide, 3 for the positive
              slide are good tests. will inherit sample.network_multiplier if not set
          --n [string]   negative prompt, will inherit sample.neg if not set
         Only 75 tokens allowed currently
         I like to do a wide positive and negative spread so I can see a good range and stop
         early if the network is braking down
        prompts:
          - "a nurse is standing in the hospital corridor, stethoscope hanging on her chest, white uniform, --m -5"
          - "a nurse is standing in the hospital corridor, stethoscope hanging on her chest, white uniform, --m -3"
          - "a nurse is standing in the hospital corridor, stethoscope hanging on her chest, white uniform, --m 3"
          - "a nurse is standing in the hospital corridor, stethoscope hanging on her chest, white uniform, --m 5"
          - "a bodybuilder is posing on stage, muscles bulging, chest oiled, --m -5"
          - "a bodybuilder is posing on stage, muscles bulging, chest oiled, --m -3"
          - "a bodybuilder is posing on stage, muscles bulging, chest oiled, --m 3"
          - "a bodybuilder is posing on stage, muscles bulging, chest oiled, --m 5"
          - "a superhero is flying through the sky, cape fluttering, emblem on chest, --m -5"
          - "a superhero is flying through the sky, cape fluttering, emblem on chest, --m -3"
          - "a superhero is flying through the sky, cape fluttering, emblem on chest, --m 3"
          - "a superhero is flying through the sky, cape fluttering, emblem on chest, --m 5"
         negative prompt used on all prompts above as default if they don't have one
        neg: "cartoon, fake, drawing, illustration, cgi, animated, anime, monochrome"
         seed for sampling. 42 is the answer for everything
        seed: 42
         walks the seed so s1 is 42, s2 is 43, s3 is 44, etc
         will start over on next sample_every so s1 is always seed
         works well if you use same prompt but want different results
        walk_seed: false
         cfg scale (4 to 10 is good)
        guidance_scale: 7
         sampler steps (20 to 30 is good)
        sample_steps: 20
         default network multiplier for all prompts
         since we are training a slider, I recommend overriding this with --m [number]
         in the prompts above to get both sides of the slider
        network_multiplier: 1.0

       logging information
      logging:
        log_every: 10  log every this many steps
        use_wandb: false  not supported yet
        verbose: false  probably done need unless you are debugging

       slider training config, best for last
      slider:
         resolutions to train on. [ width, height ]. This is less important for sliders
         as we are not teaching the model anything it doesn't already know
         but must be a size it understands [ 512, 512 ] for sd_v1.5  and [ 768, 768 ] for sd_v2.1
         and [ 1024, 1024 ] for sd_xl
         you can do as many as you want here
        resolutions:
          - [ 512, 512 ]
          - [ 512, 768 ]
          - [ 768, 768 ]
          - [ 1024, 1024 ]
         slider training uses 4 combined steps for a single round. This will do it in one gradient
         step. It is highly optimized and shouldn't take anymore vram than doing without it,
         since we break down batches for gradient accumulation now. so just leave it on.
        batch_full_slide: true
         These are the concepts to train on. You can do as many as you want here,
         but they can conflict outweigh each other. Other than experimenting, I recommend
         just doing one for good results
        targets:
             target_class is the base concept we are adjusting the representation of
             for example, if we are adjusting the representation of a person, we would use "person"
             if we are adjusting the representation of a cat, we would use "cat" It is not
             a keyword necessarily but what the model understands the concept to represent.
             "person" will affect men, women, children, etc but will not affect cats, dogs, etc
             it is the models base general understanding of the concept and everything it represents
             you can leave it blank to affect everything. In this example, we are adjusting
             detail, so we will leave it blank to affect everything
          - target_class: ""

            positive: "fit physique, toned body, athletic build, muscular frame, healthy look, strong appearance, vigorous physique"

            negative: "unfit physique, flabby body, out-of-shape build, weak frame, unhealthy look, frail appearance, lethargic physique"

            weight: 1.0
             shuffle the prompts split by the comma. We will run every combination randomly
             this will make the LoRA more robust. You probably want this on unless prompt order
             is important for some reason
            shuffle: true

         anchors are prompts that we will try to hold on to while training the slider
         these are NOT necessary and can prevent the slider from converging if not done right
         leave them off if you are having issues, but they can help lock the network
         on certain concepts to help prevent catastrophic forgetting
         you want these to generate an image that is not your target_class, but close to it
         is fine as long as it does not directly overlap it.
         For example, if you are training on a person smiling,
         you could use "a person with a face mask" as an anchor. It is a person, the image is the same
         regardless if they are smiling or not, however, the closer the concept is to the target_class
         the less the multiplier needs to be. Keep multipliers less than 1.0 for anchors usually
         for close concepts, you want to be closer to 0.1 or 0.2
         these will slow down training. I am leaving them off for the demo

        anchors:
          - prompt: "a woman"
            neg_prompt: "animal"
             the multiplier applied to the LoRA when this is run.
             higher will give it more weight but also help keep the lora from collapsing
            multiplier: 1.0
          - prompt: "a man"
            neg_prompt: "animal"
            multiplier: 1.0
          - prompt: "a person"
            neg_prompt: "animal"
            multiplier: 1.0

 You can put any information you want here, and it will be saved in the model.
 The below is an example, but you can put your grocery list in it if you want.
 It is saved in the model so be aware of that. The software will include this
 plus some other information for you automatically
meta:
   [name] gets replaced with the name above
  name: "[Kappa_Neuro]"
  version: '1.0'
  creator:
    name: Your Name
    email: your@gmail.com
    website: https://your.website

image image image image image image image