Open Hellcat1005 opened 7 months ago
@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with dataloader_num_workers>0
with the default dataset lambdalabs/pokemon-blip-captions
does the error still persist?
Could you be sure that you run
accelerate config
command and arrange it properly before starting to train?
The command I ran is as follows.
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2
@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with
dataloader_num_workers>0
with the default datasetlambdalabs/pokemon-blip-captions
does the error still persist?
I use custom dataset. I can not use lambdalabs/pokemon-blip-captions now. I am waiting for the author to approve the application, but it seems to be a bit slow. The command I run is as follows. It works with dataloader_num_workers=0.
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2
@DN6 @Hellcat1005 I also found this issue while increasing the dataloader_num_workers for pretty much any dataset for this script. My issue was solved when moving to Ubuntu/WSL. So, I think this is a Windows-specific issue. The reason it happens is preprocess_train is a local function in main and can't be pickled when having multiple dataset workers(this may be specific to windows). A similar issue is this.
If you want to make it work in windows, the main solution is to make preprocess_train/collate_fn and the like global functions like in here
Thanks for investigating @isamu-isozaki! Hmm so the core issue seems to be with Pytorch multiprocessing on Windows then? Perhaps @Hellcat1005 you can modify the script to move the functions @isamu-isozaki has mentioned outside of main or run the script on an Ubuntu/WSL machine?
We can look into restructuring the training to avoid the issue on Windows but since all the scripts follow a similar structure, this would be an involved task at the moment for us.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
I can run train_text_to_image_lora.py with dataloader_num_workers=0. But it does not work with dataloader_num_workers>0.
Reproduction
I set dataloader_num_workers=4, here is the ouput.
The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of1
--num_machines
was set to a value of1
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
. 04/12/2024 10:38:20 - INFO - main - Distributed environment: DistributedType.NO Num processes: 1 Process index: 0 Local process index: 0 Device: cudaMixed precision type: fp16
{'prediction_type', 'timestep_spacing', 'rescale_betas_zero_snr', 'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding', 'sample_max_value'} was not found in config. Values will be initialized to default values. {'force_upcast', 'scaling_factor', 'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values. {'only_cross_attention', 'num_attention_heads', 'encoder_hid_dim', 'dropout', 'time_cond_proj_dim', 'time_embedding_dim', 'encoder_hid_dim_type', 'attention_type', 'dual_cross_attention', 'resnet_out_scale_factor', 'projection_class _embeddings_input_dim', 'num_class_embeds', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_type', 'conv_out_kernel', 'conv_in_kernel', 'transformer_layers_per_block', 'mid_block_only_cross_attention', 'use_linear_pro jection', 'mid_block_type', 'timestep_post_act', 'upcast_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'class_embed_type', 'resnet_skip_time_act', 'reverse_transformer_layers_per_block', 'addition_embed_typenum heads', 'time_embedding_act_fn', 'resnet_time_scale_shift'} was not found in config. Values will be initialized to default values. Resolving data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 21/21 [00:00<?, ?it/s] 04/12/2024 10:38:24 - WARNING - datasets.builder - Found cached dataset imagefolder (C:/Users/HP/.cache/huggingface/datasets/imagefolder/default-f890b3e0a49a7f2c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f ) 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 503.46it/s] 04/12/2024 10:38:25 - INFO - main - Running training 04/12/2024 10:38:25 - INFO - main - Num examples = 20 04/12/2024 10:38:25 - INFO - main - Num Epochs = 100 04/12/2024 10:38:25 - INFO - main - Instantaneous batch size per device = 1 04/12/2024 10:38:25 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 04/12/2024 10:38:25 - INFO - main - Gradient Accumulation steps = 4 04/12/2024 10:38:25 - INFO - main - Total optimization steps = 500 Steps: 0%| | 0/500 [00:00<?, ?it/s]T raceback (most recent call last): File "D:\work\projects\diffusers\examples\text_to_image\train_text_to_image_lora.py", line 1014, in
main()
File "D:\work\projects\diffusers\examples\text_to_image\train_text_to_image_lora.py", line 763, in main
for step, batch in enumerate(train_dataloader):
File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\data_loader.py", line 449, in iter
dataloader_iter = super().iter()
^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in iter
return self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in init
w.start()
File "D:\anaconda3\envs\py312\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\multiprocessing\context.py", line 337, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\multiprocessing\popen_spawn_win32.py", line 95, in init
reduction.dump(process_obj, to_child)
File "D:\anaconda3\envs\py312\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main..preprocess_train'
Steps: 0%| | 0/500 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "D:\anaconda3\envs\py312\Scripts\accelerate.exe__main__.py", line 7, in
File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 46, in main
args.func(args)
File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\launch.py", line 1057, in launch_command
simple_launcher(args)
File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\launch.py", line 673, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\anaconda3\envs\py312\python.exe', 'train_text_to_image_lora.py', '--dataloader_num_workers=4']' returned non-zero exit status 1.
(py312) D:\work\projects\diffusers\examples\text_to_image>Traceback (most recent call last): File "", line 1, in
File "D:\anaconda3\envs\py312\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\anaconda3\envs\py312\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
Logs
No response
System Info
diffusers
version: 0.28.0.dev0Who can help?
No response