huggingface / diffusers

πŸ€— Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.12k stars 5.38k forks source link

About dataloader_num_workers in train_text_to_image_lora.py #7646

Open Hellcat1005 opened 7 months ago

Hellcat1005 commented 7 months ago

Describe the bug

I can run train_text_to_image_lora.py with dataloader_num_workers=0. But it does not work with dataloader_num_workers>0.

Reproduction

I set dataloader_num_workers=4, here is the ouput.

The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config. 04/12/2024 10:38:20 - INFO - main - Distributed environment: DistributedType.NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

{'prediction_type', 'timestep_spacing', 'rescale_betas_zero_snr', 'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding', 'sample_max_value'} was not found in config. Values will be initialized to default values. {'force_upcast', 'scaling_factor', 'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values. {'only_cross_attention', 'num_attention_heads', 'encoder_hid_dim', 'dropout', 'time_cond_proj_dim', 'time_embedding_dim', 'encoder_hid_dim_type', 'attention_type', 'dual_cross_attention', 'resnet_out_scale_factor', 'projection_class _embeddings_input_dim', 'num_class_embeds', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_type', 'conv_out_kernel', 'conv_in_kernel', 'transformer_layers_per_block', 'mid_block_only_cross_attention', 'use_linear_pro jection', 'mid_block_type', 'timestep_post_act', 'upcast_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'class_embed_type', 'resnet_skip_time_act', 'reverse_transformer_layers_per_block', 'addition_embed_typenum heads', 'time_embedding_act_fn', 'resnet_time_scale_shift'} was not found in config. Values will be initialized to default values. Resolving data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 21/21 [00:00<?, ?it/s] 04/12/2024 10:38:24 - WARNING - datasets.builder - Found cached dataset imagefolder (C:/Users/HP/.cache/huggingface/datasets/imagefolder/default-f890b3e0a49a7f2c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f ) 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 503.46it/s] 04/12/2024 10:38:25 - INFO - main - Running training 04/12/2024 10:38:25 - INFO - main - Num examples = 20 04/12/2024 10:38:25 - INFO - main - Num Epochs = 100 04/12/2024 10:38:25 - INFO - main - Instantaneous batch size per device = 1 04/12/2024 10:38:25 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 04/12/2024 10:38:25 - INFO - main - Gradient Accumulation steps = 4 04/12/2024 10:38:25 - INFO - main - Total optimization steps = 500 Steps: 0%| | 0/500 [00:00<?, ?it/s]T raceback (most recent call last): File "D:\work\projects\diffusers\examples\text_to_image\train_text_to_image_lora.py", line 1014, in main() File "D:\work\projects\diffusers\examples\text_to_image\train_text_to_image_lora.py", line 763, in main for step, batch in enumerate(train_dataloader): File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\data_loader.py", line 449, in iter dataloader_iter = super().iter() ^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in iter return self._get_iterator() ^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator return _MultiProcessingDataLoaderIter(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in init w.start() File "D:\anaconda3\envs\py312\Lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) ^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\multiprocessing\context.py", line 337, in _Popen return Popen(process_obj) ^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\multiprocessing\popen_spawn_win32.py", line 95, in init reduction.dump(process_obj, to_child) File "D:\anaconda3\envs\py312\Lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'main..preprocess_train' Steps: 0%| | 0/500 [00:00<?, ?it/s] Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "D:\anaconda3\envs\py312\Scripts\accelerate.exe__main__.py", line 7, in File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 46, in main args.func(args) File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\launch.py", line 1057, in launch_command simple_launcher(args) File "D:\anaconda3\envs\py312\Lib\site-packages\accelerate\commands\launch.py", line 673, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\anaconda3\envs\py312\python.exe', 'train_text_to_image_lora.py', '--dataloader_num_workers=4']' returned non-zero exit status 1.

(py312) D:\work\projects\diffusers\examples\text_to_image>Traceback (most recent call last): File "", line 1, in File "D:\anaconda3\envs\py312\Lib\multiprocessing\spawn.py", line 122, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\py312\Lib\multiprocessing\spawn.py", line 132, in _main self = reduction.pickle.load(from_parent) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ EOFError: Ran out of input

Logs

No response

System Info

Who can help?

No response

DN6 commented 7 months ago

@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with dataloader_num_workers>0 with the default dataset lambdalabs/pokemon-blip-captions does the error still persist?

Hellcat1005 commented 7 months ago

Could you be sure that you run accelerate config command and arrange it properly before starting to train?

The command I ran is as follows.

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2

Hellcat1005 commented 7 months ago

@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with dataloader_num_workers>0 with the default dataset lambdalabs/pokemon-blip-captions does the error still persist?

I use custom dataset. I can not use lambdalabs/pokemon-blip-captions now. I am waiting for the author to approve the application, but it seems to be a bit slow. The command I run is as follows. It works with dataloader_num_workers=0.

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2

isamu-isozaki commented 6 months ago

@DN6 @Hellcat1005 I also found this issue while increasing the dataloader_num_workers for pretty much any dataset for this script. My issue was solved when moving to Ubuntu/WSL. So, I think this is a Windows-specific issue. The reason it happens is preprocess_train is a local function in main and can't be pickled when having multiple dataset workers(this may be specific to windows). A similar issue is this.

If you want to make it work in windows, the main solution is to make preprocess_train/collate_fn and the like global functions like in here

DN6 commented 6 months ago

Thanks for investigating @isamu-isozaki! Hmm so the core issue seems to be with Pytorch multiprocessing on Windows then? Perhaps @Hellcat1005 you can modify the script to move the functions @isamu-isozaki has mentioned outside of main or run the script on an Ubuntu/WSL machine?

We can look into restructuring the training to avoid the issue on Windows but since all the scripts follow a similar structure, this would be an involved task at the moment for us.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.