salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.59k stars 942 forks source link

Blip-Diffusion , Performance Questions #515

Open Marcophono2 opened 12 months ago

Marcophono2 commented 12 months ago

Hello!

I started playing around with BD and I am very impressed! So far I only played with the one-shot inference (which is of course not as good as I know it from a Dreambooth fine-tuned model - but: For instant generation it is really, really impressing!) May I ask a few questions about settings and performance?

  1. It 512x512 input (and output) image size a very recommended size?
  2. Can I change the scheduler? (probably not)
  3. With my RTX 4090 and a 512x512 image I count just 7.5 i/s. With a SD 1.5 model I count 40 i/s. I know that this model here is working completely different so I think this performance is okay. But getting a confirmation would help me to make a check mark behind this. :)
  4. This is the most important question for me: Do I need a full finetuned model (10 GB) for every trained subset? If so that would mean that I have to load a finetuned model on the fly every time it is needed. (as far as there is not enough VRAM to hold a lot of different fine-tuned models in the GPU memory.

Best regards Marc

dxli94 commented 12 months ago

Hi Marc, thanks for your kind words.

  1. Yes. And you can also use height, width > 512.
  2. Yes. Other schedulers would also work.
  3. Might you please clarify a bit, do you mean generating 7.5 images per second?
  4. We fine-tuned the entire U-Net during the experiments. To generate for different subjects, you need to fine-tune for each. It is possible that more VRAM-friendly fine-tuning approaches are applicable, such as LoRA, adapters etc. But we had not investigated thoroughly.

Thanks for your interest.

Marcophono2 commented 12 months ago

Thank you for this quick response, @dxli94 !

to 3: I mean 7.5 iterations per second. As the default value is 50 iterations, it takes about 7 seconds to generate a 512x512 image

to 4: You mean one full fine-tuned model per subset? Or can there be more than one subset in one fine-tuned model?

Marcophono2 commented 12 months ago

@dxli94 And, sorry for this added question: How can I disable multi-gpu support? I have 3x RTX 4090 in my server, the first two gpus are fully loaded with other models, so I only want to use cuda:2. That works if setting

device: "cuda:2"

in the finetune-db-template.yaml but train_db.sh stops anyway with

(BD2) marc@MarKI:~/Desktop/AI/BD2$ CUDA_VISIBLE_DEVICES=0 sudo ./train_db.sh
/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/diffusers/models/cross_attention.py:30: FutureWarning: Importing from cross_attention is deprecated. Please import from diffusers.models.attention_processor instead.
  deprecate(
| distributed init (rank 0, world 1): env://
2023-09-07 03:51:14,859 [INFO] 
=====  Running Parameters    =====
2023-09-07 03:51:14,859 [INFO] {
    "amp": true,
    "batch_size_eval": 1,
    "batch_size_train": 3,
    "device": "cuda:2",
    "dist_backend": "nccl",
    "dist_url": "env://",
    "distributed": true,
    "evaluate": false,
    "gpu": 0,
    "init_lr": 5e-06,
    "iters_per_inner_epoch": 40,
    "lr_sched": "constant_lr",
    "max_iters": 40,
    "min_lr": 0,
    "num_workers": 4,
    "output_dir": "/home/marc/Desktop/AI/BD2/LAVIS/projects/blip-diffusion/images/dreambooth/marcophono/output",
    "rank": 0,
    "resume_ckpt_path": null,
    "runner": "runner_iter",
    "seed": 42,
    "task": "text-to-image-generation",
    "train_splits": [
        "train"
    ],
    "weight_decay": 0.01,
    "world_size": 1
}
2023-09-07 03:51:14,859 [INFO] 
======  Dataset Attributes  ======
2023-09-07 03:51:14,859 [INFO] 
======== blip_diffusion_finetune =======
2023-09-07 03:51:14,860 [INFO] {
    "build_info": {
        "images": {
            "storage": "/home/marc/Desktop/AI/BD2/LAVIS/projects/blip-diffusion/images/dreambooth/marcophono"
        },
        "subject_text": "marcophono"
    },
    "data_type": "images",
    "kw_processor": {
        "inp_vis_processor": {
            "name": "blip_diffusion_inp_image_train"
        },
        "tgt_vis_processor": {
            "name": "blip_diffusion_tgt_image_train"
        }
    },
    "text_processor": {
        "eval": {
            "name": "blip_caption"
        },
        "train": {
            "name": "blip_caption"
        }
    }
}
2023-09-07 03:51:14,860 [INFO] 
======  Model Attributes  ======
2023-09-07 03:51:14,860 [INFO] {
    "arch": "blip_diffusion",
    "load_finetuned": false,
    "load_pretrained": true,
    "model_type": "base",
    "pretrained": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP-Diffusion/blip-diffusion.tar.gz",
    "qformer_cross_attention_freq": 1,
    "qformer_num_query_token": 16,
    "qformer_train": false,
    "sd_pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5",
    "sd_train_text_encoder": false,
    "vae_half_precision": true,
    "vit_model": "clip_L"
}
2023-09-07 03:51:14,860 [INFO] Building datasets...
2023-09-07 03:51:16,538 [INFO] freeze vision encoder
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 

pip install accelerate

.
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 

pip install accelerate

.
/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/diffusers/configuration_utils.py:215: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
2023-09-07 03:51:24,900 [INFO] Loading pretrained model from /root/.cache/torch/hub/checkpoints/blip-diffusion
No ctx_embeddings_cache found in /root/.cache/torch/hub/checkpoints/blip-diffusion
2023-09-07 03:51:26,865 [INFO] Start training, max_iters=40, in total 1 inner epochs.
2023-09-07 03:51:29,196 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
2023-09-07 03:51:29,197 [INFO] Loaded 500000 records for train split from the dataset.
2023-09-07 03:51:29,206 [INFO] number of trainable parameters: 859533252
2023-09-07 03:51:29,206 [INFO] Start training epoch 0, 40 iters per inner epoch.
Traceback (most recent call last):
  File "/home/marc/Desktop/AI/BD2/LAVIS/train.py", line 103, in <module>
    main()
  File "/home/marc/Desktop/AI/BD2/LAVIS/train.py", line 99, in main
    runner.train()
  File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/runners/runner_iter.py", line 99, in train
    train_stats = self.train_iters(self.cur_epoch, start_iters)
  File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/runners/runner_iter.py", line 145, in train_iters
    return self.task.train_iters(
  File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 144, in train_iters
    return self._train_inner_loop(
  File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 222, in _train_inner_loop
    loss, loss_dict = self.train_step(model=model, samples=samples)
  File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 64, in train_step
    output = model(samples)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward
    self._sync_buffers()
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers
    self._default_broadcast_coalesced(
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Tensors must be CUDA and dense
Exception in thread Thread-1 (_pin_memory_loop):
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 607181) of binary: /home/marc/anaconda3/envs/BD2/bin/python
Traceback (most recent call last):
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/marc/Desktop/AI/BD2/LAVIS/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-07_03:51:31
  host      : MarKI
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 607181)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html