kijai / ComfyUI-FluxTrainer

Apache License 2.0
473 stars 25 forks source link

ERROR: OOM on [FluxTrainValidate], but training w|o problem #59

Open Maelstrom2014 opened 1 month ago

Maelstrom2014 commented 1 month ago
                  ERROR    Traceback (most recent call last):                                                   execution.py:387
                               File "C:\ai\comfyui\ComfyUI\execution.py", line 317, in execute
                                 output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all,
                             execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\ComfyUI\execution.py", line 192, in get_output_data
                                 return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION,
                             allow_interrupt=True, execution_block_cb=execution_block_cb,
                             pre_execute_cb=pre_execute_cb)
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\ComfyUI\execution.py", line 169, in _map_node_over_list
                                 process_inputs(input_dict, i)
                               File "C:\ai\comfyui\ComfyUI\execution.py", line 158, in process_inputs
                                 results.append(getattr(obj, func)(**inputs))
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\nodes.py", line 1064,
                             in validate
                                 image_tensors = network_trainer.sample_images(*params)
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\flux_train_network_comfy.py"
                             , line 290, in sample_images
                                 image_tensors = flux_train_utils.sample_images(
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_train_utils.py"
                             , line 89, in sample_images
                                 image_tensor = sample_image_inference(
                                                ^^^^^^^^^^^^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_train_utils.py"
                             , line 233, in sample_image_inference
                                 x = ae.decode(x)
                                     ^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_models.py",
                             line 348, in decode
                                 return self.decoder(z)
                                        ^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\python_embeded\Lib\site-packages\torch\nn\modules\module.py",
                             line 1553, in _wrapped_call_impl
                                 return self._call_impl(*args, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\python_embeded\Lib\site-packages\torch\nn\modules\module.py",
                             line 1562, in _call_impl
                                 return forward_call(*args, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_models.py",
                             line 280, in forward
                                 h = self.up[i_level].block[i_block](h)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\python_embeded\Lib\site-packages\torch\nn\modules\module.py",
                             line 1553, in _wrapped_call_impl
                                 return self._call_impl(*args, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "C:\ai\comfyui\python_embeded\Lib\site-packages\torch\nn\modules\module.py",
                             line 1562, in _call_impl
                                 return forward_call(*args, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_models.py",
                             line 103, in forward
                                 h = swish(h)
                                     ^^^^^^^^
                               File
                             "C:\ai\comfyui\ComfyUI\custom_nodes\ComfyUI-FluxTrainer\library\flux_models.py",
                             line 53, in swish
                                 return x * torch.sigmoid(x)
                                        ~~^~~~~~~~~~~~~~~~~~
                             torch.OutOfMemoryError: Allocation on device

                    ERROR    Got an OOM, unloading all loaded models.                                             execution.py:397
2024-09-12 17:32:05.327892 #8 [FluxTrainValidate]: 84.79s
                    INFO     Prompt executed in 4121.35 seconds                                                        main.py:138
Maelstrom2014 commented 1 month ago

For now, the only fix is to delete all validation sampling nodes. :(

WillScarlettOhara commented 1 month ago

For validation sampling only, please allow us with 10-12GB VRAM GPUs to use a quantized model that consumes less VRAM like GGUF. Maybe add an optional pretrained flux model input to the Flux Train Validation Settings node. It's quite restrictive not to be able to follow the training evolution.

Thank you