wanghuan-kunpneg commented 5 months ago

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

please tell us what kind of hardware can reproduce your error? 请告诉我们您报错的后端类型
- [ 910 ] Ascend

Software Environment | 软件环境

MindSpore version: 请告诉我们您正在使用的MindSpore版本：
- [ ] 2.2.10
Python version( 3.8.8):
OS(centOS 8.2)
GCC/Compiler version:8.5.0

Describe the current behavior | 目前输出

[root@n1 pangu_draw_v3]# ./run_sampling.sh flash attention is available. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False Loading model from ['/wanghuan/low_timestamp_model.ckpt'] [ERROR] ME(855962:281473315727456,MainProcess):2024-01-29-02:07:34.586.661 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file. Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto checkpoint_list.ParseFromString(pb_content) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString return self.MergeFromString(serialized) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString if self._InternalParse(serialized, 0, length) != length: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 703, in DecodeRepeatedField raise _DecodeError('Truncated message.') google.protobuf.message.DecodeError: Truncated message.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "pangu_sampling.py", line 454, in sample(args) File "pangu_sampling.py", line 322, in sample model, filter = create_model( File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model model = load_model_from_config(config.model, checkpoints, amp_level=amp_level) File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config _sd_dict = ms.load_checkpoint(ckpt) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto raise ValueError(err_info) from e ValueError: Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have: 请告诉我们您期望得到的结果或功能

如何解决报错

Steps to reproduce the issue | 复现报错的步骤

code url | 代码链接:
command that can reproduce your error | 可以复现报错的命令 e.g. cd xx -> bash scripts/xx.sh --config xx
xx

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

townwish4git commented 5 months ago

根据报错：Failed to read the checkpoint file /wanghuan/low_timestamp_model.ckpt. May not have permission to read it, please check the correct of the file.和google.protobuf.message.DecodeError: Truncated message.，怀疑文件损坏，您可以检查一下本地ckpt文件的sha256和下载链接里ckpt文件名后缀是否一致

wanghuan-kunpneg commented 5 months ago

修改后依然报错，请问： 1、仍然是ckpt文件的问题吗？ 2、出现了内存错误，是否跟运行环境有关？目前采用了单卡910，32核64G，是否不满足运行要求？

重新下载了两个文件，检查了校验值 [root@n1 ckpt]# ll 总用量 27032600 -rwxr-xr-x. 1 root root 13840689166 12月 22 10:13 pangu_high_timestamp-c6344411.ckpt -rwxr-xr-x. 1 root root 13840689166 12月 22 10:24 pangu_low_timestamp-127da122.ckpt

[root@n1 ckpt]# pwd

/wanghuan/ckpt

[root@n1 ckpt]# sha256sum pangu_high_timestamp-c6344411.ckpt

c6344411e5f889941e6f6b9653499c476adb598b0a520877cf1a86d931e6e041 pangu_high_timestamp-c6344411.ckpt

[root@n1 ckpt]# sha256sum pangu_low_timestamp-127da122.ckpt

127da12275180c72e82e6173b8dd80d099507dcf2546fa139cdf4bde1d196965 pangu_low_timestamp-127da122.ckpt

修改脚本路径： "run_sampling.sh" 16L, 560C 15,69 全部 export MS_PYNATIVE_GE=1 export current_dir=/wanghuan/pangu_draw_v3 export PYTHONPATH=$current_dir:$PYTHONPATH cd $current_dir

run script

When the device is running low on memory, the '--offload' parameter might be effective.

python pangu_sampling.py \ --device_target "Ascend" \ --ms_mode 1 \ --ms_amp_level "O2" \ --config "configs/inference/pangu_sd_xl_base.yaml" \ --high_solution \ --weight "/wanghuan/ckpt/pangu_low_timestamp-c6344411.ckpt" \ --high_timestamp_weight "/wanghuan/ckpt/pangu_high_timestamp-127da122.ckpt" \ --prompts_file "prompts.txt"

报错信息： [root@n1 pangu_draw_v3]# ./run_sampling.sh flash attention is available. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False Loading model from ['/wanghuan/ckpt/pangu_low_timestamp-127da122.ckpt'] [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.028 [mindspore/train/serialization.py:1378] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint. [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.204 [mindspore/train/serialization.py:1383] conditioner.embedders.0.transformer.text_model.embeddings.position_ids is not loaded. [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.276 [mindspore/train/serialization.py:1383] conditioner.embedders.1.model.attn_mask is not loaded. missing keys: ['conditioner.embedders.0.transformer.text_model.embeddings.position_ids', 'conditioner.embedders.1.model.attn_mask'] constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads Loading model from ['/wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt'] [ERROR] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:37.590.458 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file. Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto checkpoint_list.ParseFromString(pb_content) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString return self.MergeFromString(serialized) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString if self._InternalParse(serialized, 0, length) != length: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField if value.add()._InternalParse(buffer, pos, new_pos) != new_pos: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField if value._InternalParse(buffer, pos, new_pos) != new_pos: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 632, in DecodeField field_dict[key] = buffer[pos:new_pos].tobytes() MemoryError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "pangu_sampling.py", line 454, in sample(args) File "pangu_sampling.py", line 333, in sample high_timestampmodel, = create_model( File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model model = load_model_from_config(config.model, checkpoints, amp_level=amp_level) File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config _sd_dict = ms.load_checkpoint(ckpt) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto raise ValueError(err_info) from e ValueError: Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.

townwish4git commented 5 months ago

这是您的ckpt文件： [root@n1 ckpt]# ll 总用量 27032600 -rwxr-xr-x. 1 root root 13840689166 12月 22 10:13 _pangu_hightimestamp-c6344411.ckpt -rwxr-xr-x. 1 root root 13840689166 12月 22 10:24 _pangu_lowtimestamp-127da122.ckpt

这是您运行的脚本： python pangu_sampling.py --device_target "Ascend" --ms_mode 1 --ms_amp_level "O2" --config "configs/inference/pangu_sd_xl_base.yaml" --highsolution --weight "**/wanghuan/ckpt/pangu_lowtimestamp-c6344411.ckpt" --high_timestamp_weight "_/wanghuan/ckpt/pangu_hightimestamp-127da122.ckpt**" --prompts_file "prompts.txt"

ckpt文件名似乎对不上？您可以检查一下两个ckpt文件是否正确命名和导入

wanghuan-kunpneg commented 5 months ago

脚本拷贝错误了，后来发现文件名称改错误，已经修改成正确的了，错误提示不太一样

[root@n1 pangu_draw_v3]# ./run_sampling.sh flash attention is available. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads Initialized embedder #0: FrozenCnCLIPEmbedder with 115972685 params. Trainable: False Initialized embedder #1: FrozenOpenCLIPEmbedder2 with 694665770 params. Trainable: False Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False Loading model from ['/wanghuan/ckpt/pangu_low_timestamp-127da122.ckpt'] [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.028 [mindspore/train/serialization.py:1378] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint. [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.204 [mindspore/train/serialization.py:1383] conditioner.embedders.0.transformer.text_model.embeddings.position_ids is not loaded. [WARNING] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:12.705.276 [mindspore/train/serialization.py:1383] conditioner.embedders.1.model.attn_mask is not loaded. missing keys: ['conditioner.embedders.0.transformer.text_model.embeddings.position_ids', 'conditioner.embedders.1.model.attn_mask'] constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads Loading model from ['/wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt'] [ERROR] ME(1377551:281473775790176,MainProcess):2024-01-29-05:38:37.590.458 [mindspore/train/serialization.py:1261] Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file. Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1253, in _parse_ckpt_proto checkpoint_list.ParseFromString(pb_content) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/message.py", line 199, in ParseFromString return self.MergeFromString(serialized) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1106, in MergeFromString if self._InternalParse(serialized, 0, length) != length: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 705, in DecodeRepeatedField if value.add()._InternalParse(buffer, pos, new_pos) != new_pos: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField if value._InternalParse(buffer, pos, new_pos) != new_pos: File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1173, in InternalParse pos = field_decoder(buffer, new_pos, end, self, field_dict) File "/usr/local/lib64/python3.8/site-packages/google/protobuf/internal/decoder.py", line 632, in DecodeField field_dict[key] = buffer[pos:new_pos].tobytes() MemoryError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "pangu_sampling.py", line 454, in sample(args) File "pangu_sampling.py", line 333, in sample high_timestampmodel, = create_model( File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 160, in create_model model = load_model_from_config(config.model, checkpoints, amp_level=amp_level) File "/wanghuan/pangu_draw_v3/gm/helpers.py", line 285, in load_model_from_config _sd_dict = ms.load_checkpoint(ckpt) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1087, in load_checkpoint checkpoint_list = _parse_ckpt_proto(ckpt_file_name, dec_key, dec_mode) File "/usr/local/lib/python3.8/site-packages/mindspore/train/serialization.py", line 1262, in _parse_ckpt_proto raise ValueError(err_info) from e ValueError: Failed to read the checkpoint file /wanghuan/ckpt/pangu_high_timestamp-c6344411.ckpt. May not have permission to read it, please check the correct of the file.

townwish4git commented 5 months ago

64G的显存正常加载这两个ckpt是没有问题的。您可以试着单独写个脚本 from mindspore import load_checkpoint来load_checkpoint(high_timestamp_model_file.ckpt)看看是否能正常加载ckpt文件

wanghuan-kunpneg commented 5 months ago

910A是否需要用分布式推理？请问下运行方式是什么？

1、64G的内存。。。昇腾910A，显存32G，仍然内存错误 2、换了一台8卡Atlas800-9000的物理机，模型可以运行起来，报错信息如下：

Sampling with PanGuEulerEDMSampler for 40 steps: 100%|███████████████████████████████████| 40/40 [08:37<00:00, 12.93s/it] Sample latent Done. Decode latent Starting... Traceback (most recent call last): File "pangu_sampling.py", line 454, in sample(args) File "pangu_sampling.py", line 403, in sample amp_level=args.ms_amp_level, File "pangu_sampling.py", line 203, in run_txt2img amp_level=amp_level, File "/wanghuan/pangu_draw_v3/gm/models/diffusion.py", line 347, in pangu_do_sample samples_x = self.decode_first_stage(samples_z) File "/wanghuan/pangu_draw_v3/gm/models/diffusion.py", line 91, in decode_first_stage out = self.first_stage_model.decode(z) File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 718, in staging_specialize out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args, **kwargs) File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 121, in wrapper results = fn(*arg, kwargs) File "/root/miniconda3/envs/mindspore_py37/lib/python3.7/site-packages/mindspore/common/api.py", line 356, in call** output = self._graph_executor(tuple(new_inputs), phase) RuntimeError:

Memory not enough:

Device(id:0) memory isn't enough and alloc failed, kernel name: Default/decoder-Decoder/up-CellList/0-UpCell/block-CellList/0-ResnetBlock/norm1-_OutputTo16/_backbone-GroupNorm/Cast-op52827, alloc size: 1073741824B.

C++ Call Stack: (For framework developers)

mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:682 Run

townwish4git commented 5 months ago

32G显存同时加载high/low_timestamp_model会面临显存问题，可以在推理脚本加入参数--offload

ultranationalism commented 4 months ago

32G显存同时加载high/low_timestamp_model会面临显存问题，可以在推理脚本加入参数--offload

用fp16跑，实际上只用了14G显存

mindspore-lab / mindone

pangu draw 3.0 执行 ./run_sampling.sh 报错 #317

Hardware Environment | 硬件环境

Software Environment | 软件环境

Describe the current behavior | 目前输出

Describe the expected behavior | 期望输出

Steps to reproduce the issue | 复现报错的步骤

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

run script

When the device is running low on memory, the '--offload' parameter might be effective.

910A是否需要用分布式推理？请问下运行方式是什么？

Memory not enough:

C++ Call Stack: (For framework developers)