kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 854 forks source link

tensorflow.python.framework.errors_impl.ResourceExhaustedError: #1160

Closed FurkanGozukara closed 7 months ago

FurkanGozukara commented 7 months ago

I was doing a training over 21 hours on RunPod - 21:52:51

Training got cancelled with following error

The pod still have over 100 GB disk space

█████ | 32547/51360 [21:52:50<12:38:51, 2.42s/it, avr_loss=0.0996]Traceback (most recent call last): File "/workspace/kohya_ss/./sdxl_train.py", line 792, in train(args) File "/workspace/kohya_ss/./sdxl_train.py", line 657, in train accelerator.log(logs, step=global_step) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 619, in _inner return PartialState().on_main_process(function)(args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2399, in log tracker.log(values, step=step, log_kwargs.get(tracker.name, {})) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/tracking.py", line 79, in execute_on_main_process return PartialState().on_main_process(function)(self, args, **kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/tracking.py", line 247, in log self.writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 1200, in flush writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 150, in flush self.event_writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 125, in flush self._async_writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 190, in flush self._writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorboard/summary/writer/record_writer.py", line 43, in flush self._writer.flush() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/tensorflow/python/lib/io/file_io.py", line 221, in flush self._writable_file.flush() tensorflow.python.framework.errors_impl.ResourceExhaustedError: /workspace/stable-diffusion-webui/models/Stable-diffusion/sdxl_1_fp32/log/20240306030504/finetuning/events.out.tfevents.1709694327.7f4fcd28f189.2380.0; Disk quota exceeded steps: 63%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 32547/51360 [21:52:51<12:38:51, 2.42s/it, avr_loss=0.0996] Traceback (most recent call last): File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', './sdxl_train.py',

kohya-ss commented 7 months ago

Logging with TensorBoard writes data every step, and the error said Disk quota exceeded, so even the disk has free space, some quota might be applied for output to the disk. Unfortunately there is no option to disable stepwise log, please disable logging (removing logging_dir option).

FurkanGozukara commented 7 months ago

Logging with TensorBoard writes data every step, and the error said Disk quota exceeded, so even the disk has free space, some quota might be applied for output to the disk. Unfortunately there is no option to disable stepwise log, please disable logging (removing logging_dir option).

thanks will remember this.