microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.42k stars 4.11k forks source link

[BUG] RecursionError: maximum recursion depth exceeded while calling a Python object #4440

Closed nahidalam closed 1 month ago

nahidalam commented 1 year ago

Describe the bug Multigpu training with deepspeed gives me this error. The training works fine in single gpu.

RecursionError: maximum recursion depth exceeded while calling a Python object

To Reproduce I get the recursion error in the training loop with this line

outputs = model(pixel_values=batch["pixel_values"].to(device),
                      input_boxes=batch["input_boxes"].to(device),
                      multimask_output=False)

My code looks like below

import deepspeed
ds_config = {
    "train_batch_size": 8,
    "steps_per_print": 10,
}

def train_loop(model, optimizer, dataloader):
    for epoch in range(num_epochs):
        epoch_losses = []
        for batch in tqdm(dataloader):
            outputs = model(pixel_values=batch["pixel_values"].to(device),
                      input_boxes=batch["input_boxes"].to(device),
                      multimask_output=False)

            # compute loss

            # backward pass (compute gradients of parameters w.r.t. loss)
            optimizer.zero_grad()
            loss.backward()

            # optimize
            optimizer.step()

def main():
    model = SamModel.from_pretrained("facebook/sam-vit-base")
    processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
    optimizer = torch.optim.Adam(model.mask_decoder.parameters(), lr=lr, weight_decay=wd)
    train_dataset = MyCustomDataset(....)
    train_dataloader = DataLoader(train_dataset, batch_size=ds_config["train_batch_size"], shuffle=True)

    # Wrap the model and optimizer with Deepspeed for model parallelism
    model, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config) 
    model.to(device)  
    # make sure we only compute gradients for mask decoder
    for name, param in model.named_parameters():
        if name.startswith("vision_encoder") or name.startswith("prompt_encoder"):
            param.requires_grad_(False)

    # Wrap the model with DataParallel for distributed training
    model = torch.nn.DataParallel(model)
    wandb.watch(model)
    model.train()
    train_loop(model, optimizer, train_dataloader)

Expected behavior It should run fine ..

ds_report output

[2023-10-02 16:52:56,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/..../env/lib/python3.9/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/..../env/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.10.3, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.5 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0 shared memory (/dev/shm) size .... 755.27 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context deepspeed --num_gpus=2 app.py

Docker context Conda environment with python 3.9 Requirements.txt below

gradio==3.27.0
matplotlib==3.7.1
numpy==1.24.1
opencv_python==4.7.0.72
Pillow==9.3.0
pycocotools==2.0.6
segment_anything==1.0
torch==2.0.0
torchvision==0.15.1
tqdm==4.65.0
jupyterlab==4.0.5
jupyter-ai
jupyter==1.0.0
jupyterlab-widgets<3
ipywidgets==7.7.2
ipykernel==6.23.1
ipympl==0.9.3
jupyter-bbox-widget==0.5.0
roboflow==1.0.8
dataclasses-json==0.5.7
supervision==0.7.0
pandas
scikit-learn
wget
wandb==0.15.7
typing-extensions!=4.7.0,>=4.6.0
datasets==2.14.4
rust
cargo
monai==1.2.0
scikit-image==0.21.0
transformers==4.33.0
deepspeed==0.10.3

Additional context Add any other context about the problem here.

mrwyattii commented 1 year ago

@nahidalam can you please provide the full stack trace so we can see where the error is coming from? Thanks

nahidalam commented 1 year ago

@mrwyattii please see below full stack trace

[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   optimizer_legacy_fusion ...... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   optimizer_name ............... None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   optimizer_params ............. None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   pld_enabled .................. False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   pld_params ................... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   prescale_gradients ........... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   scheduler_name ............... None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print]   scheduler_params ............. None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   sparse_attention ............. None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   sparse_gradients_enabled ..... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   steps_per_print .............. 10
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   train_batch_size ............. 8
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   train_micro_batch_size_per_gpu  4
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   use_node_local_storage ....... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   wall_clock_breakdown ......... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   weight_quantization_config ... None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   world_size ................... 2
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   zero_allow_untested_optimizer  False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print]   zero_enabled ................. False
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print]   zero_force_ds_cpu_optimizer .. True
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print]   zero_optimization_stage ...... 0
[2023-10-03 06:40:38,732] [INFO] [config.py:957:print_user_config]   json = {
    "train_batch_size": 8,
    "steps_per_print": 10
}
  0%|                                                                                                        | 0/15 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
    main()
  File "/home/nahalam/ufsam/src/app.py", line 242, in main
    train_loop(model, optimizer, train_dataloader)
  File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
    outputs = model(pixel_values=batch["pixel_values"].to(device),
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
    if self.module.training:
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  [Previous line repeated 1490 more times]
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
    if name in dir(self):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

Traceback (most recent call last):
  File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
    main()
  File "/home/nahalam/ufsam/src/app.py", line 242, in main
    train_loop(model, optimizer, train_dataloader)
  File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
    outputs = model(pixel_values=batch["pixel_values"].to(device),
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
    if self.module.training:
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  [Previous line repeated 1490 more times]
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
    if name in dir(self):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
  0%|                                                                                                        | 0/15 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
    main()
  File "/home/nahalam/ufsam/src/app.py", line 242, in main
    train_loop(model, optimizer, train_dataloader)
  File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
    outputs = model(pixel_values=batch["pixel_values"].to(device),
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
    if self.module.training:
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  [Previous line repeated 1490 more times]
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
    if name in dir(self):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

Traceback (most recent call last):
  File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
    main()
  File "/home/nahalam/ufsam/src/app.py", line 242, in main
    train_loop(model, optimizer, train_dataloader)
  File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
    outputs = model(pixel_values=batch["pixel_values"].to(device),
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
    if self.module.training:
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
    return getattr(self, name)
  [Previous line repeated 1490 more times]
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
    if name in dir(self):
  File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
nahidalam commented 1 year ago

@mrwyattii updated the stack trace

catqaq commented 1 year ago

same error~

g-h-chen commented 2 months ago

hi guys, any solutions here?

Swordfish1990 commented 2 months ago

Same error.I think I've found the cause of this error, but I don't know how to solve it. #6534

tohtana commented 1 month ago

Hi all, I can't run the above script due to the lack of dataset and loss function, but you are using model = torch.nn.DataParallel(model). You don't need to wrap a model when you use DeepSpeed.

tohtana commented 1 month ago

Closing as we didn't have an activity for a while. Please feel free to reopen it if you have any update.