mindspore-lab / mindone

one for all, Optimal generator with No Exception
Apache License 2.0
328 stars 62 forks source link

进行lora训练的时候,增加了收集summary功能,但是收集的时候报错 #522

Closed jxzhang789 closed 2 weeks ago

jxzhang789 commented 3 weeks ago

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

Software Environment | 软件环境

Describe the current behavior | 目前输出

e.g. the current output is xxx/ the error is xxx/ 目前的输出是xx 、 目前的报错是关于xx [ERROR] ME(26861:281473712992272,ForkServerPoolWorker-1:14):2024-05-31-12:00:04.932.734 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image.

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have: 请告诉我们您期望得到的结果或功能 能收集到训练相关数据,可以在mindinsight中查看

Steps to reproduce the issue | 复现报错的步骤

1.在train_text_to_image.py 脚本中增加收集summary代码

interval_1 = [x for x in range(1, 4)] specified = {"collect_metric": True, "histogram_regular": "^conv1.|^conv2.", "collect_graph": True, "collect_dataset_graph": True,"collect_train_lineage":True,"collect_input_data":True, 'collect_landscape': {'landscape_size': 40,'unit': "epoch",'create_landscape': {'train': True,'result': False}, 'num_samples': 128,'intervals': [interval_1]} } summary_collector = ms.SummaryCollector(summary_dir="./summary_dir/summary_01", collect_specified_data=specified, collect_freq=12, keep_default_action=False)

callbacks

callback = [TimeMonitor(args.callback_size)]
ofm_cb = OverflowMonitor()
callback.append(ofm_cb)
callback.append(summary_collector)

2.执行训练脚本 python train_text_to_image.py \ --train_config configs/train/train_config_lora_v2.yaml \ --data_path /home/ma-user/work/dataset \ --output_path /home/ma-user/work/output0531 \ --pretrained_model_path models/sd_v2_768_v-e12e3a9b.ckpt

Related log / screenshot | 完整日志

[ERROR] ME(26861:281473712992272,ForkServerPoolWorker-1:14):2024-05-31-12:00:04.932.734 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image. [2024-05-31 12:00:05] INFO: epoch: 1 step: 2, loss: 0.686330, loss scale: 65536, average step time: 0.422665. [2024-05-31 12:00:05] INFO: epoch: 1 step: 3, loss: 0.615680, loss scale: 65536, average step time: 0.330626. [2024-05-31 12:00:05] INFO: epoch: 1 step: 4, loss: 1.013419, loss scale: 65536, average step time: 0.323141. [2024-05-31 12:00:06] INFO: epoch: 1 step: 5, loss: 0.515399, loss scale: 65536, average step time: 0.904796. [2024-05-31 12:00:07] INFO: epoch: 1 step: 6, loss: 1.046023, loss scale: 65536, average step time: 0.332177. Train epoch time: 611627.001 ms, per step time: 101937.834 ms [2024-05-31 12:00:42] INFO: Checkpoint saved in /home/ma-user/work/output0531_1/ckpt/sd-1.ckpt [2024-05-31 12:01:22] INFO: epoch: 2 step: 1, loss: 0.840738, loss scale: 65536, average step time: 75.032792. [2024-05-31 12:01:22] INFO: epoch: 2 step: 2, loss: 0.766551, loss scale: 65536, average step time: 0.319096. [2024-05-31 12:01:22] INFO: epoch: 2 step: 3, loss: 0.586941, loss scale: 65536, average step time: 0.316810. [2024-05-31 12:01:23] INFO: epoch: 2 step: 4, loss: 0.998357, loss scale: 65536, average step time: 0.316289. [2024-05-31 12:01:23] INFO: epoch: 2 step: 5, loss: 0.644005, loss scale: 65536, average step time: 0.313123. [2024-05-31 12:01:23] INFO: epoch: 2 step: 6, loss: 1.051421, loss scale: 65536, average step time: 0.320373. Train epoch time: 2023.618 ms, per step time: 337.270 ms [ERROR] ME(26860:281473712992272,ForkServerPoolWorker-1:13):2024-05-31-12:01:23.821.974 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image. [ERROR] ME(26896:281473712992272,ForkServerPoolWorker-1:32):2024-05-31-12:01:23.866.151 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image. [2024-05-31 12:01:59] INFO: Checkpoint saved in /home/ma-user/work/output0531_1/ckpt/sd-2.ckpt [2024-05-31 12:02:37] INFO: epoch: 3 step: 1, loss: 0.887232, loss scale: 65536, average step time: 73.946778. [2024-05-31 12:02:38] INFO: epoch: 3 step: 2, loss: 0.588924, loss scale: 65536, average step time: 0.327915. [2024-05-31 12:02:38] INFO: epoch: 3 step: 3, loss: 0.548785, loss scale: 65536, average step time: 0.312505. [2024-05-31 12:02:38] INFO: epoch: 3 step: 4, loss: 0.569951, loss scale: 65536, average step time: 0.311562. [2024-05-31 12:02:38] INFO: epoch: 3 step: 5, loss: 0.688446, loss scale: 65536, average step time: 0.313019. [2024-05-31 12:02:39] INFO: epoch: 3 step: 6, loss: 1.009740, loss scale: 65536, average step time: 0.311985. Train epoch time: 1981.923 ms, per step time: 330.320 ms [2024-05-31 12:03:15] INFO: Checkpoint saved in /home/ma-user/work/output0531_1/ckpt/sd-3.ckpt [2024-05-31 12:03:55] INFO: epoch: 4 step: 1, loss: 0.883718, loss scale: 65536, average step time: 76.276575. [2024-05-31 12:03:55] INFO: epoch: 4 step: 2, loss: 0.995311, loss scale: 65536, average step time: 0.318537. [2024-05-31 12:03:56] INFO: epoch: 4 step: 3, loss: 0.946222, loss scale: 65536, average step time: 0.325014. [2024-05-31 12:03:56] INFO: epoch: 4 step: 4, loss: 0.800794, loss scale: 65536, average step time: 0.312907. [2024-05-31 12:03:56] INFO: epoch: 4 step: 5, loss: 0.673399, loss scale: 65536, average step time: 0.313613. [2024-05-31 12:03:57] INFO: epoch: 4 step: 6, loss: 0.545844, loss scale: 65536, average step time: 0.323202. Train epoch time: 1997.177 ms, per step time: 332.863 ms [ERROR] ME(26883:281473712992272,ForkServerPoolWorker-1:27):2024-05-31-12:03:57.229.127 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image. [ERROR] ME(26850:281473712992272,ForkServerPoolWorker-1:7):2024-05-31-12:03:57.254.907 [mindspore/train/summary/_summary_adapter.py:363] The dimension of Summary tensor should be 4 or second dimension should be 1 or 3, but got tag = input_data/auto, ndim = 4, shape=(2, 512, 512, 3), which means Summary tensor is not Image. [2024-05-31 12:03:57] INFO: Checkpoint saved in /home/ma-user/work/output0531_1/ckpt/sd-4.ckpt [WARNING] ME(24163:281473611857936,MainProcess):2024-05-31-12:04:37.455.11 [mindspore/train/callback/_summary_collector.py:909] The learning rate detected in the optimizer is not a Parameter type, so it is not recorded. Its type is '_IteratorLearningRate'.

Songyuanwei commented 3 weeks ago

在使用summarycollector收集数据时,请设置"collect_input_data":False。由于collect_input_data要求的数据输入格式第二维为channel,而sdv2的输入为(2,512,512,3),第4维为channel,因此会报上述错误,可设置"collect_input_data":False先不收集该部分数据。