mindspore-lab / mindone

one for all, Optimal generator with No Exception
Apache License 2.0
329 stars 63 forks source link

SDXL-Lora多卡微调时报错:The indices has duplicate elements #456

Closed Hartmon8 closed 1 month ago

Hartmon8 commented 2 months ago

Hardware Environment | 硬件环境

Software Environment | 软件环境

Describe the current behavior | 目前输出

参考lora微调指南,单卡可以正常训练。

改用多卡,会报错误。

Finish preparing normal sample in 1 attempt(s)
Dataloader num parallel workers: [16]
scheduler_config not exist, train with base_lr 0.0001 and lr_scaler 1.0
[-1, -1, -1, -1, -1, 31, 63, 63, 223, 383, 543, 703, 863, 1023, 1055, 1087, 1119, 1119, 1119, 1119, 1119]
Traceback (most recent call last):
  File "/data/sdtest/mindone/examples/stable_diffusion_xl/train.py", line 693, in <module>
    train(args)
  File "/data/sdtest/mindone/examples/stable_diffusion_xl/train.py", line 239, in train
    ms.set_auto_parallel_context(all_reduce_fusion_config=all_reduce_fusion_config)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/context.py", line 876, in set_auto_parallel_context
    _set_auto_parallel_context(**kwargs)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/parallel/_auto_parallel_context.py", line 1275, in _set_auto_parallel_context
    set_func(value)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/parallel/_auto_parallel_context.py", line 626, in set_all_reduce_fusion_split_indices
    raise ValueError("The indices has duplicate elements")
ValueError: The indices has duplicate elements

其中:[-1, -1, -1, -1, -1, 31, 63, 63, 223, 383, 543, 703, 863, 1023, 1055, 1087, 1119, 1119, 1119, 1119, 1119]print(all_reduce_fusion_config)的输出。

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have: 请告诉我们您期望得到的结果或功能: lora微调支持多机多卡

zhanghuiyao commented 2 months ago

看日志是reduce fusion过程报的错,可以先尝试关闭该接口试一下--ms_enable_allreduce_fusion False