hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.7k stars 4.34k forks source link

代码bugs太多了[BUG]: #4008

Open wangmiaowei opened 1 year ago

wangmiaowei commented 1 year ago

🐛 Describe the bug

代码问题太多了,建议重新审核维护

Environment

No response

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: There are too many code bugs [BUG]:

flybird11111 commented 1 year ago

Could you please provide more details about the errors?

wangmiaowei commented 1 year ago

Firstly, in examples/images/diffusion/configs/Teyvat/train_colossalai_teyvat.yaml. If I change use_ema: True. Then I get error like:

/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Epoch 0:   1%| | 1/105 [00:12<21:38, 12.49s/it, loss=0.852, v_num=0, train/loss_simple_step=0.852, train/loss_v/opt/conda/lib/python3.7/site-packages/lightning/pytorch/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
  rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
Summoning checkpoint.
    self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
    self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
    self.model_ema(self.model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
Traceback (most recent call last):
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/main.py", line 847, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 609, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
    self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 230, in advance
    self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/opt/conda/lib/python3.7/site-packages/lightning/pytorch/trainer/trainer.py", line 1347, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 500, in on_train_batch_end
    self.model_ema(self.model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/programs_wmw/sd_train/ColossalAI-main/examples/images/diffusion/ldm/modules/ema.py", line 46, in forward
    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'
wangmiaowei commented 1 year ago

Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72

You use attn_bias which is not found in the 0.0.12 xformers

        attn_output = xops.memory_efficient_attention(query_states,
                                                      key_states,
                                                      value_states,
                                                      attn_bias=xops.LowerTriangularMask(),
                                                      p=self.dropout if self.training else 0.0,
                                                      scale=self.scaling)
wangmiaowei commented 1 year ago

In your reference readme file https://github.com/hpcaitech/ColossalAI/tree/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/examples/images/diffusion#:~:text=Teyvat/train_colossalai_teyvat.yaml-,Inference,logdir/checkpoints/last.ckpt%20%5C%0A%20%20%20%20%2D%2Dconfig%20/path/to/logdir/configs/project.yaml%20%20%5C,-usage%3A%20txt2img.py

The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:

raise KeyError("Expected key `target` to instantiate.")
KeyError: 'Expected key `target` to instantiate.
wangmiaowei commented 1 year ago

Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.

wangmiaowei commented 1 year ago

In fact, this repo has lots of conflicts and errors. I hope your group carefully checks the whole part.

flybird11111 commented 1 year ago

Thank you. We will address these issues.

flybird11111 commented 1 year ago

Besides, in your readme, you tell that I need to use xformers0.0.12. However, in the code

https://github.com/hpcaitech/ColossalAI/blob/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/applications/Chat/coati/kernels/opt_attn.py#L72

You use attn_bias which is not found in the 0.0.12 xformers

        attn_output = xops.memory_efficient_attention(query_states,
                                                      key_states,
                                                      value_states,
                                                      attn_bias=xops.LowerTriangularMask(),
                                                      p=self.dropout if self.training else 0.0,
                                                      scale=self.scaling)

If you want to run the chat, you can upgrade the version of xformer.

zhangvia commented 1 year ago

除了xformers和1.5配置文件没给之外,用examples/images/diffusion代码去多机多卡训练,基本就没成功过。总会报各种奇怪的错,比如socket timeout,core dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingyan

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Except that xformers and 1.5 configuration files are not given, using examples/images/diffusion codes to multi-machine multi-card training has basically failed. Various strange errors will always be reported, such as socket timeout, core dump, illegal instruction. Even sometimes a single machine with multiple cards will report socket timeout. Judging from the test results you gave, you should have successfully tested it on multiple machines and multiple cards. Why are there so many bugs in the warehouse code? However, the video memory consumption and acceleration effect are really attractive, and I really want to try to use colossalai multi-machine multi-card training SD, but your bug really discourages you. It is recommended to retest the code and upload it again @jiangmingyan

wangmiaowei commented 1 year ago

@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia I'm looking at mosaiai now https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py At least it works!

zhangvia commented 1 year ago

@zhangvia 我现在在看mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py 至少能用!

diffusers的代码也可以用,但是显存消耗太大了,512的图bs只能设1...这个mosaiai显存消耗咋样,能多机多卡跑吗

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia I am looking at mosaiai https://github.com/mosaicml/diffusion/blob/main/diffusion/train.py at least it works!

The code of diffusers can also be used, but the video memory consumption is too large, and the bs of the 512 picture can only be set to 1... How about the video memory consumption of this mosaiai, can it run on multiple machines and multiple cards

wangmiaowei commented 1 year ago

@zhangvia 老哥,3090上可以跑的。。。单机多卡跑过

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia Brother, it can run on 3090. . . Single-machine multi-card running

zhangvia commented 1 year ago

@wangmiaowei 可以,我再调一调这个colossalai吧,实在不行就换mosaiai。colossalai不知道为啥确实显存降得特别多,速度也快,就是bug太多了。4090上 512,bs设16 都能跑确实强。可惜bug太多了

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@wangmiaowei Yes, let me adjust this colossalai again, if it doesn’t work, change to mosaiai. Colossalai doesn't know why the video memory has dropped so much and the speed is fast, but there are too many bugs. It is really strong to run 512 on 4090 and set 16 on bs. Too bad there are too many bugs

tiandiao123 commented 1 year ago

Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.

@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.

Thomas2419 commented 1 year ago

Would the current docker container support this new branch? @tiandiao123

tiandiao123 commented 1 year ago

Would the current docker container support this new branch? @tiandiao123

not yet, we can make one!

zhangvia commented 1 year ago

Another question, do you try to finetune stable diffusion V1.5? In this finetune code, I only see you use sd v2.0.ckpt. If I use v1.5-pruned.ckpt, then the mismatch error also occurs.

@wangmiaowei we updated trining process using new Booster API: https://github.com/hpcaitech/ColossalAI/tree/feature/stable-diffusion/applications/stable-diffusion/text_img2img. you can check this new branch for stable-diffusion update. We re-trained stable-diffusion of V1.4. But I think it is quite similar to train a V1.5 version. You can only need to change model name and update fine-tuning dataset in bash script. Then, you can automatically train your model.

where is the environment.yaml in the new branch? the same as the main repo examples/images/diffusion/environment.yaml? but i didn't see diffusers library in this environment.yaml. and you use diffusers in the new training script in new branch. what is the exact version of diffusers in your new training scripts? or could you please share the new environment.yaml file?

wangmiaowei commented 1 year ago

@zhangvia 老哥试了吗?感觉如何?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia Did you try it? How does it feel?

wangmiaowei commented 1 year ago

@tiandiao123 我试了这个新分支(I HAVE TRIED THIS NEW BRANCH),但是你们直接把ema功能给取消了。是遇到bug还没有解决吗?

zhangvia commented 1 year ago

@zhangvia 老哥试了吗?感觉如何?

试了,可以跑起来,用那个分支的colossalai版本加上0.17.1diffusers可以跑,效果和之前的训练代码差不多。ema的话我看是没取消的,只不过from diffusers.training_utils import EMAModel这句被删掉了,加上应该就可以了吧,我没试。这个分支的代码也是有点问题的,看着像半成品

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Did @zhangvia brother try it? How does it feel?

Try it, you can run it, you can run it with the colossalai version of that branch plus 0.17.1diffusers, the effect is similar to the previous training code. I don’t think ema has been cancelled, but the sentence from diffusers.training_utils import EMAModel has been deleted, and it should be enough to add it. I haven’t tried it. The code of this branch is also a bit problematic, it looks like a semi-finished product

wangmiaowei commented 1 year ago

@zhangvia 确实,就是个半成品,moving average功能直接腰斩了。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia Indeed, it is a semi-finished product, and the moving average function is directly cut in half.

zhangvia commented 1 year ago

这个新分支的代码的数据读取有问题,多卡的时候不会对dataloader进行分片,把dataloader丢进booster.boost也没用,还是不会分片。我看了下boost的源码,貌似geminiplugin就没对dataloader进行分片@tiandiao123

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


There is a problem with the data reading of the code of this new branch. When there are multiple cards, the dataloader will not be fragmented. It is useless to throw the dataloader into booster.boost, and it will not be fragmented. I looked at the source code of boost, and it seems that geminiplugin did not fragment the dataloader @tiandiao123

densechen commented 1 year ago

无论是 pytorch lightning 2.0 提供的 lightning-colossalai 还是这里提供的 colo diffusion,花费大量精力调试了代码,虽然能够成功跑起来,并且loss貌似也下降正常,但是始终没有得到一个正确的结果。弃了弃了。

另外,配置文件里面,

    # scheduler_config: # 10000 warmup steps
    #   warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
    #   cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
    #   f_start: [ 1.e-6 ]
    #   f_max: [ 1.e-4 ]
    #   f_min: [ 1.e-10 ]

初始学习率是 1e-4,你这在给一个这么小的 f,学习率直接在 1e-8的范围,这参数真的可以有效训练吗?

@wangmiaowei mosaicml你训练过吗,可以在多机多卡的环境下训练得到正确的结果吗

Youngon commented 10 months ago

ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya

老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱

Issues-translate-bot commented 10 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using colossalai to train SD on multiple machines and cards, but you guys really discourage me from this bug. It is recommended to retest the code and upload it again @jiangmingya

Brother, did you succeed in running multi-card DDP? I always get stuck when running, and no error is reported. This is outrageous.

Youngon commented 10 months ago

@zhangvia 老哥试了吗?感觉如何?

试了,可以跑起来,用那个分支的colossalai版本加上0.17.1diffusers可以跑,效果和之前的训练代码差不多。ema的话我看是没取消的,只不过from diffusers.training_utils import EMAModel这句被删掉了,加上应该就可以了吧,我没试。这个分支的代码也是有点问题的,看着像半成品

请问用的哪个分支?跑的是DDP吗

Issues-translate-bot commented 10 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@zhangvia Bro, have you tried it? How does it feel?

I tried it and it can run. I can run it using the colossalai version of that branch plus 0.17.1diffusers. The effect is similar to the previous training code. As for ema, I think it has not been canceled, but the sentence from diffusers.training_utils import EMAModel has been deleted. It should be enough to add it, but I haven't tried it. There are also some problems with the code of this branch. It looks like a semi-finished product.

Which branch are you using? Are you running DDP?

flybird11111 commented 10 months ago

ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya

老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱

请问您跑哪个脚本卡住了?

Issues-translate-bot commented 10 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using colossalai to train SD on multiple machines and cards, but you guys really discourage me from this bug. It is recommended to retest the code and upload it again @jiangmingya

Bro, have you succeeded in running multi-card DDP? I always get stuck when running, and no error is reported, which is outrageous.

Which script did you run and got stuck?

Youngon commented 10 months ago

ore dump,illegal instruction。甚至有时候单机多卡也会报socket timeout。看你们给的测试结果,你们应该也是成功在多机多卡上测试过的,为什么仓库的代码这么多bug呢?不过显存消耗和加速效果确实非常吸引人,我也确实很想尝试一下用colossalai多机多卡训sd,奈何你们这bug真的劝退。建议重新测试一下代码,再上传吧 @jiangmingya

老哥跑多卡DDP成功了吗 我跑总是卡住 没有报错 也是离谱

请问您跑哪个脚本卡住了?

examples/images/diffusion train_ddp.sh(需要修改下config文件 去掉才能正常跑)

Issues-translate-bot commented 10 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


ore dump, illegal instruction. Even sometimes multiple cards on a single machine will report socket timeout. Judging from the test results you gave me, you should have successfully tested on multiple machines and multiple cards. Why are there so many bugs in the code in the warehouse? However, the memory consumption and acceleration effect are indeed very attractive. I really want to try using Colossalai to train SD on multiple machines and cards, but this bug really dissuades me. It is recommended to retest the code and upload it again @jiangmingya

Bro, have you succeeded in running multi-card DDP? I always get stuck when running, and no error is reported, which is outrageous.

Which script did you run and got stuck?

examples/image/diffusion train_ddp.sh (you need to modify the config file and remove it to run normally)

Youngon commented 8 months ago

In your reference readme file https://github.com/hpcaitech/ColossalAI/tree/d4fb7bfda7a2da5480e1187e8d3e40884b42ba11/examples/images/diffusion#:~:text=Teyvat/train_colossalai_teyvat.yaml-,Inference,logdir/checkpoints/last.ckpt%20%5C%0A%20%20%20%20%2D%2Dconfig%20/path/to/logdir/configs/project.yaml%20%20%5C,-usage%3A%20txt2img.py

The output project file does not have any "target" 2023-06-15T16-43-07-project.yaml which is requires by txt2img and error occures:

raise KeyError("Expected key `target` to instantiate.")
KeyError: 'Expected key `target` to instantiate.

Is this issue solved? I have finished trainning but cant do inferrence.

Ly403 commented 6 months ago

就离谱,为什么连配置文件里面缺少target这种情况都能出现。而且为什么要把原来stable diffusion仓库里面的逻辑改了,原先cond_stage_config可以通过配置文件指定条件编码器,现在直接硬编码到python文件里面了。而且就算这样改为什么还留一半,model里面的target又是需要填写的。

Issues-translate-bot commented 6 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


It's outrageous, why can this situation even occur if target is missing in the configuration file? And why should we change the logic in the original stable diffusion warehouse? Originally cond_stage_config could specify the conditional encoder through the configuration file, but now it is directly hard-coded into the python file. And even if you change it like this, why do you still leave half of it? The target in the model needs to be filled in.