mlfoundations / open-diffusion

Simple large-scale training of stable diffusion with multi-node support.
122 stars 8 forks source link

Classifier-free guidance? #6

Open mehdidc opened 1 year ago

mehdidc commented 1 year ago

I might have missed it in the code, but I can't see whether we randomly drop the captions for classifier-free guidance (which is already used at inference).

vkramanuj commented 1 year ago

Hi, thanks for the question. I didn't notice a practical difference between no text dropout and some text dropout in my experiments, so I left it out of this repo. However, I can push a branch later today and potentially merge after some testing. For reference, the implementation is just randomly substituting the input string with the empty string, similar to how it's done at inference time in diffusers (https://github.com/huggingface/diffusers/blob/384c83aa9a1f268e5587d5ea1ea9f4c040845167/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L371)

mehdidc commented 1 year ago

Ah, you mean you could apply classifier-free guidance at inference even the model never encounter empty strings (like done in the repo you mention), isn't this unexpected?

vkramanuj commented 1 year ago

Yes, the guidance_scale parameter still worked, which is surprising. However it's possible there's some hit in performance for longer training runs, so it does make sense to add to this repo.

mehdidc commented 1 year ago

I think it's interesting that it worked anyway, on which dataset you were training ?

vkramanuj commented 1 year ago

@mehdidc I've used several datasets with this setup, mostly various filtered versions of LAION-2b (e.g. laion aesthetics and laion high res). I've added text dropout in the text-dropout branch (https://github.com/mlfoundations/open-diffusion/tree/text-dropout). Specifically, the changes are at:

  1. To add the option to the WebDataset class: https://github.com/mlfoundations/open-diffusion/blob/8090dc9121592fcb3df7b604fc63180e837209cd/data/base.py#L113
  2. To add the conditional dropout to the data pipeline: https://github.com/mlfoundations/open-diffusion/blob/8090dc9121592fcb3df7b604fc63180e837209cd/data/base.py#L168

I haven't been able to test this code recently due to lack of resources. Let me know if you get a chance to try this out, and I can merge it into main.

mehdidc commented 1 year ago

Thanks @vkramanuj for the implementation, I can try to do some runs, do you maybe have the config file you used in your tests with LAION aesthetics and/or high res, so that we can compare more or less directly ?

vkramanuj commented 1 year ago

Here's one. I removed my wandb and some pathing info for privacy reasons. You'd need to replace the webdataset path with one for laion high-res aesthetics (either original width/height >=512 or >=1024). Try to make the global batch size 2048 with either more GPUs or gradient accumulation. Note this is using the SD v1 architecture, which I found has better throughput and allows for greater per gpu batch size.

system:
    gradient_accumulation: 1
    batch_size: 32
    workers: 6
    dist_backend: ${distributed.dist_backend}
    dist_url: ${distributed.dist_url}

distributed:
    dist_backend: 'nccl'
    dist_url: 'env://'

experiment:
    log_dir: <path>/sd-logs
    name: "laion-2b-aesthetics-hr"
    project: "diffusion"
    num_examples_to_see: 2000000000
    save_every: 2000
    requeue: True

optimizer:
    name: adamw
    params:
        learning_rate: 0.0001
        beta1: 0.9
        beta2: 0.98 # changed from initial sd value for training stability
        weight_decay: 0.01
        epsilon: 0.00000001

model:
    vae:
        pretrained: "<path>/stable-diffusion-v1-5-fp32"

    text_encoder:
        pretrained: "<path>/stable-diffusion-v1-5-fp32"

    tokenizer:
        pretrained: "<path>/stable-diffusion-v1-5-fp32"

    scheduler:
        pretrained: "<path>/stable-diffusion-v1-5-fp32"

    unet:
        target: UNet2DConditionModel
        params:
            act_fn: "silu"
            attention_head_dim: 8
            block_out_channels: [320, 640, 1280, 1280]
            center_input_sample: False
            cross_attention_dim: 768
            down_block_types: ["CrossAttnDownBlock2D","CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D"]
            downsample_padding: 1
            flip_sin_to_cos: true
            freq_shift: 0
            in_channels: 4
            layers_per_block: 2
            mid_block_scale_factor: 1
            norm_eps: 1e-05
            norm_num_groups: 32
            out_channels: 4
            sample_size: 32
            up_block_types: [
                "UpBlock2D",
                "CrossAttnUpBlock2D",
                "CrossAttnUpBlock2D",
                "CrossAttnUpBlock2D"
            ]

    use_ema: True
    mixed_precision: bf16
    gradient_checkpointing: True
    xformers: True

dataset:
    type: WebDataset
    params: 
        path: "pipe:aws s3 cp s3://s-datasets/laion5b/laion2B-data/{000000..231349}.tar -"
        batch_size: ${system.batch_size}
        workers: ${system.workers}
        num_examples_to_see: ${experiment.num_examples_to_see}
        resolution: 512
        text_dropout: 0.0

lr_scheduler:
    scheduler: "ConstantWithWarmup"
    params:
        learning_rate: ${optimizer.params.learning_rate}
        warmup_length: 500