Short Description

Run the training with train_instruct_pix2pix_sdxl.py It fails on --dataloader_num_workers RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

What have I tried

I found:

Set it to zero (load in main process) will skip the error.
the argument --dataloader_num_workers works fine in script train_instruct_pix2pix.py

I've tried to add the following code on the beginning, still failed.

import multiprocessing
multiprocessing.set_start_method('spawn')

The main log part after specify spawn:

Traceback (most recent call last):
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 1224, in <module>
    main()
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 959, in main
    for step, batch in enumerate(train_dataloader):
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/accelerate/data_loader.py", line 381, in __iter__
    dataloader_iter = super().__iter__()
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'main.<locals>.preprocess_train'

Reproduction

export NUM_OF_LOADERS=32  # set it to zero (load in main process) will skip the error

accelerate config default

accelerate launch --mixed_precision="fp16" train_instruct_pix2pix_sdxl.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --original_image_column="original_image" \
    --edited_image_column="edited_image" \
    --edit_prompt_column="edit_prompt" \
    --enable_xformers_memory_efficient_attention \
    --allow_tf32 \
    --use_ema \
    --use_8bit_adam \
    --resolution 256 \
    --random_flip \
    --train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 10 \
    --checkpointing_steps 500 \
    --learning_rate "5e-05" \
    --max_grad_norm 1 \
    --lr_warmup_steps 0 \
    --conditioning_dropout_prob 0.05 \
    --mixed_precision "fp16" \
    --dataloader_num_workers ${NUM_OF_LOADERS} \
    --seed 42 \
    --validation_steps 500

Logs

2023-08-17 06:15:14.651678: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-17 06:15:14.680856: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-17 06:15:15.122553: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Configuration already exists at /home/frank/.cache/huggingface/accelerate/default_config.yaml, will not override. Run `accelerate config` manually or pass a different `save_location`.
2023-08-17 06:15:17.799941: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-17 06:15:17.828894: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-17 06:15:18.269743: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-17 06:15:20.389164: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-17 06:15:20.418189: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-17 06:15:20.869733: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/17/2023 06:15:21 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'attention_type'} was not found in config. Values will be initialized to default values.
08/17/2023 06:15:24 - INFO - __main__ - Initializing the XL InstructPix2Pix UNet from the pretrained UNet.
/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py:658: UserWarning: weight_dtype torch.float16 may cause nan during vae encoding
  warnings.warn(f"weight_dtype {weight_dtype} may cause nan during vae encoding", UserWarning)
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
08/17/2023 06:15:50 - INFO - __main__ - ***** Running training *****
08/17/2023 06:15:50 - INFO - __main__ -   Num examples = 5191
08/17/2023 06:15:50 - INFO - __main__ -   Num Epochs = 10
08/17/2023 06:15:50 - INFO - __main__ -   Instantaneous batch size per device = 4
08/17/2023 06:15:50 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
08/17/2023 06:15:50 - INFO - __main__ -   Gradient Accumulation steps = 4
08/17/2023 06:15:50 - INFO - __main__ -   Total optimization steps = 3250
Steps:   0%|                                                                                                                         | 0/3250 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 1219, in <module>
    main()
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 954, in main
    for step, batch in enumerate(train_dataloader):
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/accelerate/data_loader.py", line 384, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2807, in __getitems__
    batch = self.__getitem__(keys)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    return self._getitem(key)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2788, in _getitem
    formatted_output = format_table(
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 629, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 400, in __call__
    return self.format_batch(pa_table)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 515, in format_batch
    return self.transform(batch)
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 834, in preprocess_train
    prompt_embeds_all, add_text_embeds_all = compute_embeddings_for_prompts(captions, text_encoders, tokenizers)
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 788, in compute_embeddings_for_prompts
    prompt_embeds_all, pooled_prompt_embeds_all = encode_prompts(text_encoders, tokenizers, prompts)
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 777, in encode_prompts
    prompt_embeds, pooled_prompt_embeds = encode_prompt(text_encoders, tokenizers, prompt)
  File "/home/frank/git/thirdparty/diffusers/train/train_instruct_pix2pix_sdxl.py", line 756, in encode_prompt
    text_input_ids.to(text_encoder.device),
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/torch/cuda/__init__.py", line 235, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Steps:   0%|                                                                                                                         | 0/3250 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/home/frank/anaconda3/envs/sd/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/frank/anaconda3/envs/sd/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/frank/anaconda3/envs/sd/bin/python', 'train_instruct_pix2pix_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=/home/frank/.cache/huggingface/datasets/degraded-image-pairs-6.8/default-23e2e744b3c8a7b7/0.0.0', '--original_image_column=original_image', '--edited_image_column=edited_image', '--edit_prompt_column=edit_prompt', '--enable_xformers_memory_efficient_attention', '--allow_tf32', '--use_ema', '--use_8bit_adam', '--resolution', '256', '--random_flip', '--train_batch_size', '4', '--gradient_accumulation_steps', '4', '--num_train_epochs', '10', '--checkpointing_steps', '500', '--learning_rate', '5e-05', '--max_grad_norm', '1', '--lr_warmup_steps', '0', '--conditioning_dropout_prob', '0.05', '--mixed_precision', 'fp16', '--dataloader_num_workers', '32', '--seed', '42', '--validation_steps', '500']' returned non-zero exit status 1.

System Info

diffusers-cli env

diffusers version: 0.20.0.dev0
Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Python version: 3.10.6
PyTorch version (GPU?): 2.0.1+cu117 (True)
Huggingface_hub version: 0.16.4
Transformers version: 4.31.0
Accelerate version: 0.21.0
xFormers version: 0.0.20
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@williamberman @sayakpaul

bghira commented 1 year ago

i believe for this to work you have to write the script in such a way that the functions are not included in the main operating context, as it is telling you at the bottom that you cannot pickle a function.

it is trying to pickle things and essentially serialise them for consumption in child processes. it is quite annoying.

you will have to refactor that script to instead rely on external modules. but that's something that isn't likely to be supported by Huggingface, it goes against the philosophy document.

can you identify why it works in SD 1.5 but not SDXL?

frankjiang commented 1 year ago

i believe for this to work you have to write the script in such a way that the functions are not included in the main operating context, as it is telling you at the bottom that you cannot pickle a function.

it is trying to pickle things and essentially serialise them for consumption in child processes. it is quite annoying.

you will have to refactor that script to instead rely on external modules. but that's something that isn't likely to be supported by Huggingface, it goes against the philosophy document.

can you identify why it works in SD 1.5 but not SDXL?

Following the log, the error happens in preprocress_train Compared to SD 1.5 version, there are differences

# SD 1.5
captions = list(examples[edit_prompt_column])
examples["input_ids"] = tokenize_captions(captions)

# SDXL
captions = list(examples[edit_prompt_column])
prompt_embeds_all, add_text_embeds_all = compute_embeddings_for_prompts(captions, text_encoders, tokenizers)
examples["prompt_embeds"] = prompt_embeds_all
examples["add_text_embeds"] = add_text_embeds_all

in which

    # Adapted from examples.dreambooth.train_dreambooth_lora_sdxl
    # Here, we compute not just the text embeddings but also the additional embeddings
    # needed for the SD XL UNet to operate.
    def compute_embeddings_for_prompts(prompts, text_encoders, tokenizers):
        with torch.no_grad():
            prompt_embeds_all, pooled_prompt_embeds_all = encode_prompts(text_encoders, tokenizers, prompts)
            add_text_embeds_all = pooled_prompt_embeds_all

            prompt_embeds_all = prompt_embeds_all.to(accelerator.device)
            add_text_embeds_all = add_text_embeds_all.to(accelerator.device)
        return prompt_embeds_all, add_text_embeds_all

I'm not sure, if the inner function is to blame.

sayakpaul commented 1 year ago

Could be because of the interplay between pre-computing (which utilizes GPUs) and the spawning of multiple threads.

So, it's recommended to not use multiple workers here.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers

mutliprocessing error from data loader in train_instruct_pix2pix_sdxl.py #4639

Describe the bug