Open wangmiaowei opened 1 year ago
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: There are too many code bugs [BUG]:
Could you please provide more details about the errors?
Firstly, in examples/images/diffusion/configs/Teyvat/train_colossalai_teyvat.yaml. If I change use_ema: True. Then I get error like:
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Epoch 0: 1%| | 1/105 [00:12<21:38, 12.49s/it, loss=0.852, v_num=0, train/loss_simple_step=0.852, train/loss_v/opt/conda/lib/python3.7/site-packages/lightning/pytorch/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
trainer.fit(model, data)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
Summoning checkpoint.
self._run_train()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
self.model_ema(self.model)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Traceback (most recent call last):
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
trainer.fit(model, data)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
self._run_train()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
self.model_ema(self.model)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72
You use attn_bias which is not found in the 0.0.12 xformers
attn_output = xops.memory_efficient_attention(query_states,
key_states,
value_states,
attn_bias=xops.LowerTriangularMask(),
p=self.dropout if self.training else 0.0,
scale=self.scaling)
The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:
raise KeyError("Expected key `target` to instantiate.")
KeyError: 'Expected key `target` to instantiate.
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
In fact, this repo has lots of conflicts and errors. I hope your group carefully checks the whole part.
Thank you. We will address these issues.
Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code
You use attn_bias which is not found in the 0.0.12 xformers
attn_output = xops.memory_efficient_attention(query_states, key_states, value_states, attn_bias=xops.LowerTriangularMask(), p=self.dropout if self.training else 0.0, scale=self.scaling)
If you want to run the chat, you can upgrade the version of xformer.
除了xformers和1.5配置文件没给之外,用examples/images/diffusion代码去多机多卡训练,基本就没成功过。总会报各种奇怪的错,比如socket timeout,core dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingyan
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Except that xformers and 1.5 configuration files are not given, using examples/images/diffusion codes to multi-machine multi-card training has basically failed. Various strange errors will always be reported, such as socket timeout, core dump, illegal instruction. Even sometimes a single machine with multiple cards will report socket timeout. Judging from the test results you gave, you should have successfully tested it on multiple machines and multiple cards. Why are there so many bugs in the warehouse code? However, the video memory consumption and acceleration effect are really attractive, and I really want to try to use colossalai multi-machine multi-card training SD, but your bug really discourages you. It is recommended to retest the code and upload it again @jiangmingyan
@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia I'm looking at mosaiai now https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py At least it works!
@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!
diffusers的代码也可以用,但是显存消耗太大了,512的图bs只能设1...这个mosaiai显存消耗咋样,能多机多卡跑吗
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia I am looking at mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py at least it works!
The code of diffusers can also be used, but the video memory consumption is too large, and the bs of the 512 picture can only be set to 1... How about the video memory consumption of this mosaiai, can it run on multiple machines and multiple cards
@zhangvia 老哥,3090上可以跑的。。。单机多卡跑过
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Brother, it can run on 3090. . . Single-machine multi-card running
@wangmiaowei 可以,我再调一调这个colossalai吧,实在不行就换mosaiai。colossalai不知道为啥确实显存降得特别多,速度也快,就是bug太多了。4090上 512,bs设16 都能跑确实强。可惜bug太多了
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@wangmiaowei Yes, let me adjust this colossalai again, if it doesn’t work, change to mosaiai. Colossalai doesn't know why the video memory has dropped so much and the speed is fast, but there are too many bugs. It is really strong to run 512 on 4090 and set 16 on bs. Too bad there are too many bugs
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.
Would the current docker container support this new branch? @tiandiao123
Would the current docker container support this new branch? @tiandiao123
not yet, we can make one!
Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.
@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.
where is the environment.yaml in the new branch? the same as the main repo examples/images/diffusion/environment.yaml? but i didn't see diffusers library in this environment.yaml. and you use diffusers in the new training script in new branch. what is the exact version of diffusers in your new training scripts? or could you please share the new environment.yaml file?
@zhangvia 老哥试了吗?感觉如何?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Did you try it? How does it feel?
@tiandiao123 我试了这个新分支(I HAVE TRIED THIS NEW BRANCH),但是你们直接把ema功能给取消了。是遇到bug还没有解决吗?
@zhangvia 老哥试了吗?感觉如何?
试了,可以跑起来,用那个分支的colossalai版本加上0.17.1diffusers可以跑,效果和之前的训练代码差不多。ema的话我看是没取消的,只不过from diffusers.training_utils import EMAModel这句被删掉了,加上应该就可以了吧,我没试。这个分支的代码也是有点问题的,看着像半成品
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Did @zhangvia brother try it? How does it feel?
Try it, you can run it, you can run it with the colossalai version of that branch plus 0.17.1diffusers, the effect is similar to the previous training code. I don’t think ema has been cancelled, but the sentence from diffusers.training_utils import EMAModel has been deleted, and it should be enough to add it. I haven’t tried it. The code of this branch is also a bit problematic, it looks like a semi-finished product
@zhangvia 确实,就是个半成品,moving average功能直接腰斩了。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Indeed, it is a semi-finished product, and the moving average function is directly cut in half.
这个新分支的代码的数据读取有问题,多卡的时候不会对dataloader进行分片,把dataloader丢进booster.boost也没用,还是不会分片。我看了下boost的源码,貌似geminiplugin就没对dataloader进行分片@tiandiao123
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
There is a problem with the data reading of the code of this new branch. When there are multiple cards, the dataloader will not be fragmented. It is useless to throw the dataloader into booster.boost, and it will not be fragmented. I looked at the source code of boost, and it seems that geminiplugin did not fragment the dataloader @tiandiao123
无论是 pytorch lightning 2.0 提供的 lightning-colossalai 还是这里提供的 colo diffusion,花费大量精力调试了代码,虽然能够成功跑起来,并且loss貌似也下降正常,但是始终没有得到一个正确的结果。弃了弃了。
另外,配置文件里面,
# scheduler_config: # 10000 warmup steps
# warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
# cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
# f_start: [ 1.e-6 ]
# f_max: [ 1.e-4 ]
# f_min: [ 1.e-10 ]
初始学习率是 1e-4,你这在给一个这么小的 f,学习率直接在 1e-8的范围,这参数真的可以有效训练吗?
@wangmiaowei mosaicml你训练过吗,可以在多机多卡的环境下训练得到正确的结果吗
ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya
老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using colossalai to train SD on multiple machines and cards, but you guys really discourage me from this bug. It is recommended to retest the code and upload it again @jiangmingya
Brother, did you succeed in running multi-card DDP? I always get stuck when running, and no error is reported. This is outrageous.
@zhangvia 老哥试了吗?感觉如何?
试了,可以跑起来,用那个分支的colossalai版本加上0.17.1diffusers可以跑,效果和之前的训练代码差不多。ema的话我看是没取消的,只不过from diffusers.training_utils import EMAModel这句被删掉了,加上应该就可以了吧,我没试。这个分支的代码也是有点问题的,看着像半成品
请问用的哪个分支?跑的是DDP吗
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@zhangvia Bro, have you tried it? How does it feel?
I tried it and it can run. I can run it using the colossalai version of that branch plus 0.17.1diffusers. The effect is similar to the previous training code. As for ema, I think it has not been canceled, but the sentence from diffusers.training_utils import EMAModel has been deleted. It should be enough to add it, but I haven't tried it. There are also some problems with the code of this branch. It looks like a semi-finished product.
Which branch are you using? Are you running DDP?
ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya
老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱
请问您跑哪个脚本卡住了?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using colossalai to train SD on multiple machines and cards, but you guys really discourage me from this bug. It is recommended to retest the code and upload it again @jiangmingya
Bro, have you succeeded in running multi-card DDP? I always get stuck when running, and no error is reported, which is outrageous.
Which script did you run and got stuck?
ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya
老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱
请问您跑哪个脚本卡住了?
examples/images/diffusion train_ddp.sh(需要修改下config文件 去掉才能正常跑)
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using Colossalai to train SD on multiple machines and cards, but this bug really dissuades me. It is recommended to retest the code and upload it again @jiangmingya
Bro, have you succeeded in running multi-card DDP? I always get stuck when running, and no error is reported, which is outrageous.
Which script did you run and got stuck?
examples/image/diffusion train_ddp.sh (you need to modify the config file and remove it to run normally)
The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:
raise KeyError("Expected key `target` to instantiate.") KeyError: 'Expected key `target` to instantiate.
Is this issue solved? I have finished trainning but cant do inferrence.
就离谱,为什么连配置文件里面缺少target这种情况都能出现。而且为什么要把原来stable diffusion仓库里面的逻辑改了,原先cond_stage_config可以通过配置文件指定条件编码器,现在直接硬编码到python文件里面了。而且就算这样改为什么还留一半,model里面的target又是需要填写的。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
It's outrageous, why can this situation even occur if target is missing in the configuration file? And why should we change the logic in the original stable diffusion warehouse? Originally cond_stage_config could specify the conditional encoder through the configuration file, but now it is directly hard-coded into the python file. And even if you change it like this, why do you still leave half of it? The target in the model needs to be filled in.
🐛 Describe the bug
代码问题太多了,建议重新审核维护
Environment
No response