Open 2575044704 opened 2 months ago
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-06-24 20:18:22 INFO prepare tokenizers sdxl_train_util.py:134
2024-06-24 20:18:22 INFO prepare tokenizers sdxl_train_util.py:134
2024-06-24 20:18:24 INFO update token length: 225 sdxl_train_util.py:159
INFO Using DreamBooth method. train_network.py:172
2024-06-24 20:18:24 INFO update token length: 225 sdxl_train_util.py:159
INFO Using DreamBooth method. train_network.py:172
INFO prepare images. train_util.py:1572
INFO prepare images. train_util.py:1572
2024-06-24 20:19:52 INFO found directory /train3/1_data contains 4880036 image files train_util.py:1519
2024-06-24 20:19:52 INFO found directory /train3/1_data contains 4880036 image files train_util.py:1519
2024-06-24 20:21:41 WARNING No caption file found for 16580 images. Training will continue without captions for these images. If class token train_util.py:1550
exists, it will be used. /
16580枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します
。class tokenが存在する場合はそれを使います。
WARNING /train3/1_data/10060.webp train_util.py:1557
WARNING /train3/1_data/10067.webp train_util.py:1557
WARNING /train3/1_data/10068.webp train_util.py:1557
WARNING /train3/1_data/10069.webp train_util.py:1557
WARNING /train3/1_data/10075.webp train_util.py:1557
WARNING /train3/1_data/10090.webp... and 16575 more train_util.py:1555
2024-06-24 20:21:41 WARNING No caption file found for 16580 images. Training will continue without captions for these images. If class token train_util.py:1550
exists, it will be used. /
16580枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します
。class tokenが存在する場合はそれを使います。
WARNING /train3/1_data/10060.webp train_util.py:1557
WARNING /train3/1_data/10067.webp train_util.py:1557
WARNING /train3/1_data/10068.webp train_util.py:1557
WARNING /train3/1_data/10069.webp train_util.py:1557
WARNING /train3/1_data/10075.webp train_util.py:1557
WARNING /train3/1_data/10090.webp... and 16575 more train_util.py:1555
2024-06-24 20:21:55 INFO 4880036 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621
2024-06-24 20:21:55 INFO 4880036 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621
INFO [Dataset 0] config_util.py:565
batch_size: 16
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 64
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/train3/1_data"
image_count: 4880036
num_repeats: 1
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator: |||
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.1
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: data
caption_extension: .txt
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
INFO [Dataset 0] config_util.py:565
batch_size: 16
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 64
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: False
[Subset 0 of Dataset 0]
image_dir: "/train3/1_data"
image_count: 4880036
num_repeats: 1
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator: |||
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.1
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: data
caption_extension: .txt
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4880036/4880036 [00:47<00:00, 103732.16it/s]2024-06-24 20:22:42 INFO make buckets train_util.py:859
2024-06-24 20:22:42 INFO make buckets train_util.py:859
2024-06-24 20:23:02 INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:905
INFO bucket 0: resolution (64, 2048), count: 432 train_util.py:910
INFO bucket 1: resolution (128, 2048), count: 877 train_util.py:910
INFO bucket 2: resolution (192, 2048), count: 1172 train_util.py:910
INFO bucket 3: resolution (256, 2048), count: 1240 train_util.py:910
INFO bucket 4: resolution (320, 2048), count: 1414 train_util.py:910
INFO bucket 5: resolution (384, 2048), count: 1588 train_util.py:910
INFO bucket 6: resolution (448, 2048), count: 2826 train_util.py:910
INFO bucket 7: resolution (512, 1856), count: 2316 train_util.py:910
INFO bucket 8: resolution (512, 1920), count: 628 train_util.py:910
INFO bucket 9: resolution (512, 1984), count: 509 train_util.py:910
INFO bucket 10: resolution (512, 2048), count: 1526 train_util.py:910
INFO bucket 11: resolution (576, 1664), count: 16267 train_util.py:910
INFO bucket 12: resolution (576, 1728), count: 12123 train_util.py:910
INFO bucket 13: resolution (576, 1792), count: 13783 train_util.py:910
INFO bucket 14: resolution (640, 1536), count: 17673 train_util.py:910
INFO bucket 15: resolution (640, 1600), count: 13667 train_util.py:910
INFO bucket 16: resolution (704, 1408), count: 60986 train_util.py:910
INFO bucket 17: resolution (704, 1472), count: 30709 train_util.py:910
INFO bucket 18: resolution (768, 1280), count: 228754 train_util.py:910
INFO bucket 19: resolution (768, 1344), count: 137415 train_util.py:910
INFO bucket 20: resolution (832, 1216), count: 1792153 train_util.py:910
INFO bucket 21: resolution (896, 1152), count: 732682 train_util.py:910
INFO bucket 22: resolution (960, 1088), count: 307066 train_util.py:910
INFO bucket 23: resolution (1024, 1024), count: 417711 train_util.py:910
INFO bucket 24: resolution (1088, 960), count: 160702 train_util.py:910
INFO bucket 25: resolution (1152, 896), count: 315880 train_util.py:910
INFO bucket 26: resolution (1216, 832), count: 347554 train_util.py:910
INFO bucket 27: resolution (1280, 768), count: 81520 train_util.py:910
INFO bucket 28: resolution (1344, 768), count: 125354 train_util.py:910
INFO bucket 29: resolution (1408, 704), count: 23818 train_util.py:910
INFO bucket 30: resolution (1472, 704), count: 10988 train_util.py:910
2024-06-24 20:23:03 INFO bucket 31: resolution (1536, 640), count: 7141 train_util.py:910
INFO bucket 32: resolution (1600, 640), count: 3933 train_util.py:910
INFO bucket 33: resolution (1664, 576), count: 2466 train_util.py:910
INFO bucket 34: resolution (1728, 576), count: 1323 train_util.py:910
INFO bucket 35: resolution (1792, 576), count: 1158 train_util.py:910
INFO bucket 36: resolution (1856, 512), count: 734 train_util.py:910
INFO bucket 37: resolution (1920, 512), count: 197 train_util.py:910
INFO bucket 38: resolution (1984, 512), count: 153 train_util.py:910
INFO bucket 39: resolution (2048, 64), count: 31 train_util.py:910
INFO bucket 40: resolution (2048, 128), count: 64 train_util.py:910
INFO bucket 41: resolution (2048, 192), count: 87 train_util.py:910
INFO bucket 42: resolution (2048, 256), count: 127 train_util.py:910
INFO bucket 43: resolution (2048, 320), count: 186 train_util.py:910
INFO bucket 44: resolution (2048, 384), count: 278 train_util.py:910
INFO bucket 45: resolution (2048, 448), count: 437 train_util.py:910
INFO bucket 46: resolution (2048, 512), count: 388 train_util.py:910
2024-06-24 20:23:03 INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:905
INFO bucket 0: resolution (64, 2048), count: 432 train_util.py:910
INFO mean ar error (without repeats): 0.02633931813702877 train_util.py:915
INFO bucket 1: resolution (128, 2048), count: 877 train_util.py:910
INFO bucket 2: resolution (192, 2048), count: 1172 train_util.py:910
INFO bucket 3: resolution (256, 2048), count: 1240 train_util.py:910
INFO bucket 4: resolution (320, 2048), count: 1414 train_util.py:910
INFO bucket 5: resolution (384, 2048), count: 1588 train_util.py:910
INFO bucket 6: resolution (448, 2048), count: 2826 train_util.py:910
INFO bucket 7: resolution (512, 1856), count: 2316 train_util.py:910
INFO bucket 8: resolution (512, 1920), count: 628 train_util.py:910
INFO bucket 9: resolution (512, 1984), count: 509 train_util.py:910
INFO bucket 10: resolution (512, 2048), count: 1526 train_util.py:910
INFO bucket 11: resolution (576, 1664), count: 16267 train_util.py:910
INFO bucket 12: resolution (576, 1728), count: 12123 train_util.py:910
INFO bucket 13: resolution (576, 1792), count: 13783 train_util.py:910
INFO bucket 14: resolution (640, 1536), count: 17673 train_util.py:910
INFO bucket 15: resolution (640, 1600), count: 13667 train_util.py:910
INFO bucket 16: resolution (704, 1408), count: 60986 train_util.py:910
INFO bucket 17: resolution (704, 1472), count: 30709 train_util.py:910
INFO bucket 18: resolution (768, 1280), count: 228754 train_util.py:910
INFO bucket 19: resolution (768, 1344), count: 137415 train_util.py:910
INFO bucket 20: resolution (832, 1216), count: 1792153 train_util.py:910
INFO bucket 21: resolution (896, 1152), count: 732682 train_util.py:910
INFO bucket 22: resolution (960, 1088), count: 307066 train_util.py:910
INFO bucket 23: resolution (1024, 1024), count: 417711 train_util.py:910
INFO bucket 24: resolution (1088, 960), count: 160702 train_util.py:910
INFO bucket 25: resolution (1152, 896), count: 315880 train_util.py:910
INFO bucket 26: resolution (1216, 832), count: 347554 train_util.py:910
INFO bucket 27: resolution (1280, 768), count: 81520 train_util.py:910
INFO bucket 28: resolution (1344, 768), count: 125354 train_util.py:910
INFO bucket 29: resolution (1408, 704), count: 23818 train_util.py:910
INFO bucket 30: resolution (1472, 704), count: 10988 train_util.py:910
INFO bucket 31: resolution (1536, 640), count: 7141 train_util.py:910
INFO bucket 32: resolution (1600, 640), count: 3933 train_util.py:910
INFO bucket 33: resolution (1664, 576), count: 2466 train_util.py:910
INFO bucket 34: resolution (1728, 576), count: 1323 train_util.py:910
INFO bucket 35: resolution (1792, 576), count: 1158 train_util.py:910
INFO bucket 36: resolution (1856, 512), count: 734 train_util.py:910
INFO bucket 37: resolution (1920, 512), count: 197 train_util.py:910
INFO bucket 38: resolution (1984, 512), count: 153 train_util.py:910
INFO bucket 39: resolution (2048, 64), count: 31 train_util.py:910
INFO bucket 40: resolution (2048, 128), count: 64 train_util.py:910
INFO bucket 41: resolution (2048, 192), count: 87 train_util.py:910
INFO bucket 42: resolution (2048, 256), count: 127 train_util.py:910
INFO bucket 43: resolution (2048, 320), count: 186 train_util.py:910
INFO bucket 44: resolution (2048, 384), count: 278 train_util.py:910
INFO bucket 45: resolution (2048, 448), count: 437 train_util.py:910
INFO bucket 46: resolution (2048, 512), count: 388 train_util.py:910
INFO mean ar error (without repeats): 0.02633931813702877 train_util.py:915
2024-06-24 20:23:06 INFO preparing accelerator train_network.py:225
2024-06-24 20:23:07 INFO preparing accelerator train_network.py:225
accelerator device: cuda:0
INFO loading model for process 0/2 sdxl_train_util.py:30
INFO load StableDiffusion checkpoint: ./train.safetensors sdxl_train_util.py:70
accelerator device: cuda:1
INFO building U-Net sdxl_model_util.py:192
INFO loading U-Net from checkpoint sdxl_model_util.py:196
INFO U-Net: <All keys matched successfully> sdxl_model_util.py:202
INFO building text encoders sdxl_model_util.py:205
INFO loading text encoders from checkpoint sdxl_model_util.py:258
INFO text encoder 1: <All keys matched successfully> sdxl_model_util.py:272
2024-06-24 20:23:08 INFO text encoder 2: <All keys matched successfully> sdxl_model_util.py:276
INFO building VAE sdxl_model_util.py:279
INFO loading VAE from checkpoint sdxl_model_util.py:284
INFO VAE: <All keys matched successfully> sdxl_model_util.py:287
2024-06-24 20:23:10 INFO loading model for process 1/2 sdxl_train_util.py:30
INFO load StableDiffusion checkpoint: ./train.safetensors sdxl_train_util.py:70
INFO building U-Net sdxl_model_util.py:192
INFO loading U-Net from checkpoint sdxl_model_util.py:196
INFO U-Net: <All keys matched successfully> sdxl_model_util.py:202
INFO building text encoders sdxl_model_util.py:205
INFO loading text encoders from checkpoint sdxl_model_util.py:258
INFO text encoder 1: <All keys matched successfully> sdxl_model_util.py:272
INFO text encoder 2: <All keys matched successfully> sdxl_model_util.py:276
INFO building VAE sdxl_model_util.py:279
INFO loading VAE from checkpoint sdxl_model_util.py:284
INFO VAE: <All keys matched successfully> sdxl_model_util.py:287
2024-06-24 20:23:11 INFO Enable xformers for U-Net train_util.py:2660
2024-06-24 20:23:11 INFO Enable xformers for U-Net train_util.py:2660
import network module: lycoris.kohya
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Using rank adaptation algo: lokr
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Use Dropout value: 0.0
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Using rank adaptation algo: lokr
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Use Dropout value: 0.0
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: create LyCORIS for Text Encoder: 264 modules.
2024-06-24 20:23:12|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:13|[LyCORIS]-[0;32mINFO[0m: create LyCORIS for Text Encoder: 264 modules.
2024-06-24 20:23:13|[LyCORIS]-[0;32mINFO[0m: Create LyCORIS Module
2024-06-24 20:23:14|[LyCORIS]-[0;32mINFO[0m: create LyCORIS for U-Net: 1050 modules.
2024-06-24 20:23:14|[LyCORIS]-[0;32mINFO[0m: module type table: {'LokrModule': 1058, 'NormModule': 256}
2024-06-24 20:23:14|[LyCORIS]-[0;32mINFO[0m: enable LyCORIS for text encoder
2024-06-24 20:23:14|[LyCORIS]-[0;32mINFO[0m: enable LyCORIS for U-Net
2024-06-24 20:23:14 INFO use Lion optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.95)} train_util.py:3878
2024-06-24 20:23:15|[LyCORIS]-[0;32mINFO[0m: create LyCORIS for U-Net: 1050 modules.
2024-06-24 20:23:15|[LyCORIS]-[0;32mINFO[0m: module type table: {'LokrModule': 1058, 'NormModule': 256}
2024-06-24 20:23:15|[LyCORIS]-[0;32mINFO[0m: enable LyCORIS for text encoder
2024-06-24 20:23:15|[LyCORIS]-[0;32mINFO[0m: enable LyCORIS for U-Net
prepare optimizer, data loader etc.
2024-06-24 20:23:15 INFO use Lion optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.95)} train_util.py:3878
override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 381280
enable full fp16 training.
fatal: not a git repository (or any of the parent directories): .git
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 4880036
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 152512
num epochs / epoch数: 10
batch size per device / バッチサイズ: 16
gradient accumulation steps / 勾配を合計するステップ数 = 4
total optimization steps / 学習ステップ数: 381280
fatal: not a git repository (or any of the parent directories): .git
steps: 0%| | 0/381280 [00:00<?, ?it/s]
epoch 1/10
steps: 0%| | 373/381280 [1:58:19<2014:00:30, 19.03s/it, avr_loss=0.0848]
steps: 0%| | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps: 0%| | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps: 0%| | 374/381280 [1:58:30<2011:31:38, 19.01s/it, avr_loss=0.0848]
steps: 0%| | 374/381280 [1:58:35<2012:59:34, 19.03s/it, avr_loss=0.0848][rank1]: Traceback (most recent call last):
[rank1]: File "/sd-scripts/sdxl_train_network.py", line 185, in <module>
[rank1]: trainer.train(args)
[rank1]: File "/sd-scripts/train_network.py", line 806, in train
[rank1]: for step, batch in enumerate(train_dataloader):
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/data_loader.py", line 458, in __iter__
[rank1]: next_batch = next(dataloader_iter)
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[rank1]: data = self._next_data()
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[rank1]: return self._process_data(data)
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[rank1]: data.reraise()
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
[rank1]: raise exception
[rank1]: OSError: Caught OSError in DataLoader worker process 4.
[rank1]: Original Traceback (most recent call last):
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in __getitem__
[rank1]: return self.datasets[dataset_idx][sample_idx]
[rank1]: File "/sd-scripts/library/train_util.py", line 1207, in __getitem__
[rank1]: img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
[rank1]: File "/sd-scripts/library/train_util.py", line 1092, in load_image_with_face_info
[rank1]: img = load_image(image_path)
[rank1]: File "/sd-scripts/library/train_util.py", line 2352, in load_image
[rank1]: img = np.array(image, np.uint8)
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 696, in __array_interface__
[rank1]: new["data"] = self.tobytes()
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 755, in tobytes
[rank1]: self.load()
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 160, in load
[rank1]: data, timestamp, duration = self._get_next()
[rank1]: File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 127, in _get_next
[rank1]: ret = self._decoder.get_next()
[rank1]: OSError: failed to read next frame
steps: 0%| | 374/381280 [1:58:40<2014:26:18, 19.04s/it, avr_loss=0.0848]W0624 22:22:13.858000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75699 closing signal SIGTERM
E0624 22:22:14.275000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 75700) of binary: /root/.conda/envs/lora/bin/python3
Traceback (most recent call last):
File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1027, in <module>
main()
File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in main
launch_command(args)
File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_network.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-24_22:22:13
host : intern-studio-40021203
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 75700)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
显然是图片读取失败啊,检查下数据集,最好都过一下能不能用 pil打开
The image might be corrupted. I've updated dev branch to show the file name if OSError occurs, so please try with dev branch.
你看看你那日志 明着告诉你图片没有标注 你还在那库库练 然后报错日志告诉你failed to read next frame 说明你的数据集有问题 可能是图片损坏造成的 拿脚本跑一下图片检查
with Image.open(image_file_path) as img: img.verify() except (IOError, SyntaxError) as e: print(f"损坏的图片文件: {file_path}, 错误: {e}")
如果你想跳过检查潜空间这个费时的操作 可以修改sd-scripts/library/train_util.py中的is_disk_cached_latents_is_expected函数 让它直接返回True 祝你训练成功
When I was training model on 2x A100 80G machine, a few time later afrer start, there's an error occurred:
I hope the author can find the reason of this problem, thanks!!