Open Marcophono2 opened 1 year ago
Hi Marc, thanks for your kind words.
Thanks for your interest.
Thank you for this quick response, @dxli94 !
to 3: I mean 7.5 iterations per second. As the default value is 50 iterations, it takes about 7 seconds to generate a 512x512 image
to 4: You mean one full fine-tuned model per subset? Or can there be more than one subset in one fine-tuned model?
@dxli94 And, sorry for this added question: How can I disable multi-gpu support? I have 3x RTX 4090 in my server, the first two gpus are fully loaded with other models, so I only want to use cuda:2. That works if setting
device: "cuda:2"
in the finetune-db-template.yaml but train_db.sh stops anyway with
(BD2) marc@MarKI:~/Desktop/AI/BD2$ CUDA_VISIBLE_DEVICES=0 sudo ./train_db.sh
/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/diffusers/models/cross_attention.py:30: FutureWarning: Importing from cross_attention is deprecated. Please import from diffusers.models.attention_processor instead.
deprecate(
| distributed init (rank 0, world 1): env://
2023-09-07 03:51:14,859 [INFO]
===== Running Parameters =====
2023-09-07 03:51:14,859 [INFO] {
"amp": true,
"batch_size_eval": 1,
"batch_size_train": 3,
"device": "cuda:2",
"dist_backend": "nccl",
"dist_url": "env://",
"distributed": true,
"evaluate": false,
"gpu": 0,
"init_lr": 5e-06,
"iters_per_inner_epoch": 40,
"lr_sched": "constant_lr",
"max_iters": 40,
"min_lr": 0,
"num_workers": 4,
"output_dir": "/home/marc/Desktop/AI/BD2/LAVIS/projects/blip-diffusion/images/dreambooth/marcophono/output",
"rank": 0,
"resume_ckpt_path": null,
"runner": "runner_iter",
"seed": 42,
"task": "text-to-image-generation",
"train_splits": [
"train"
],
"weight_decay": 0.01,
"world_size": 1
}
2023-09-07 03:51:14,859 [INFO]
====== Dataset Attributes ======
2023-09-07 03:51:14,859 [INFO]
======== blip_diffusion_finetune =======
2023-09-07 03:51:14,860 [INFO] {
"build_info": {
"images": {
"storage": "/home/marc/Desktop/AI/BD2/LAVIS/projects/blip-diffusion/images/dreambooth/marcophono"
},
"subject_text": "marcophono"
},
"data_type": "images",
"kw_processor": {
"inp_vis_processor": {
"name": "blip_diffusion_inp_image_train"
},
"tgt_vis_processor": {
"name": "blip_diffusion_tgt_image_train"
}
},
"text_processor": {
"eval": {
"name": "blip_caption"
},
"train": {
"name": "blip_caption"
}
}
}
2023-09-07 03:51:14,860 [INFO]
====== Model Attributes ======
2023-09-07 03:51:14,860 [INFO] {
"arch": "blip_diffusion",
"load_finetuned": false,
"load_pretrained": true,
"model_type": "base",
"pretrained": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP-Diffusion/blip-diffusion.tar.gz",
"qformer_cross_attention_freq": 1,
"qformer_num_query_token": 16,
"qformer_train": false,
"sd_pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5",
"sd_train_text_encoder": false,
"vae_half_precision": true,
"vit_model": "clip_L"
}
2023-09-07 03:51:14,860 [INFO] Building datasets...
2023-09-07 03:51:16,538 [INFO] freeze vision encoder
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
pip install accelerate
.
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
pip install accelerate
.
/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/diffusers/configuration_utils.py:215: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
2023-09-07 03:51:24,900 [INFO] Loading pretrained model from /root/.cache/torch/hub/checkpoints/blip-diffusion
No ctx_embeddings_cache found in /root/.cache/torch/hub/checkpoints/blip-diffusion
2023-09-07 03:51:26,865 [INFO] Start training, max_iters=40, in total 1 inner epochs.
2023-09-07 03:51:29,196 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
2023-09-07 03:51:29,197 [INFO] Loaded 500000 records for train split from the dataset.
2023-09-07 03:51:29,206 [INFO] number of trainable parameters: 859533252
2023-09-07 03:51:29,206 [INFO] Start training epoch 0, 40 iters per inner epoch.
Traceback (most recent call last):
File "/home/marc/Desktop/AI/BD2/LAVIS/train.py", line 103, in <module>
main()
File "/home/marc/Desktop/AI/BD2/LAVIS/train.py", line 99, in main
runner.train()
File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/runners/runner_iter.py", line 99, in train
train_stats = self.train_iters(self.cur_epoch, start_iters)
File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/runners/runner_iter.py", line 145, in train_iters
return self.task.train_iters(
File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 144, in train_iters
return self._train_inner_loop(
File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 222, in _train_inner_loop
loss, loss_dict = self.train_step(model=model, samples=samples)
File "/home/marc/Desktop/AI/BD2/LAVIS/lavis/tasks/base_task.py", line 64, in train_step
output = model(samples)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward
self._sync_buffers()
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers
self._default_broadcast_coalesced(
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Tensors must be CUDA and dense
Exception in thread Thread-1 (_pin_memory_loop):
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 607181) of binary: /home/marc/anaconda3/envs/BD2/bin/python
Traceback (most recent call last):
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
main()
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/marc/anaconda3/envs/BD2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/marc/Desktop/AI/BD2/LAVIS/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-07_03:51:31
host : MarKI
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 607181)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Hello!
I started playing around with BD and I am very impressed! So far I only played with the one-shot inference (which is of course not as good as I know it from a Dreambooth fine-tuned model - but: For instant generation it is really, really impressing!) May I ask a few questions about settings and performance?
Best regards Marc