Closed nahidalam closed 1 month ago
@nahidalam can you please provide the full stack trace so we can see where the error is coming from? Thanks
@mrwyattii please see below full stack trace
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] optimizer_legacy_fusion ...... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] optimizer_name ............... None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] optimizer_params ............. None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] pld_enabled .................. False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] pld_params ................... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] prescale_gradients ........... False
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] scheduler_name ............... None
[2023-10-03 06:40:38,730] [INFO] [config.py:971:print] scheduler_params ............. None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] sparse_attention ............. None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] sparse_gradients_enabled ..... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] steps_per_print .............. 10
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] train_batch_size ............. 8
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] train_micro_batch_size_per_gpu 4
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] use_node_local_storage ....... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] wall_clock_breakdown ......... False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] weight_quantization_config ... None
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] world_size ................... 2
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] zero_allow_untested_optimizer False
[2023-10-03 06:40:38,731] [INFO] [config.py:971:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print] zero_enabled ................. False
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print] zero_force_ds_cpu_optimizer .. True
[2023-10-03 06:40:38,732] [INFO] [config.py:971:print] zero_optimization_stage ...... 0
[2023-10-03 06:40:38,732] [INFO] [config.py:957:print_user_config] json = {
"train_batch_size": 8,
"steps_per_print": 10
}
0%| | 0/15 [00:06<?, ?it/s]
Traceback (most recent call last):
File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
main()
File "/home/nahalam/ufsam/src/app.py", line 242, in main
train_loop(model, optimizer, train_dataloader)
File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
outputs = model(pixel_values=batch["pixel_values"].to(device),
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
if self.module.training:
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
[Previous line repeated 1490 more times]
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
if name in dir(self):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
Traceback (most recent call last):
File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
main()
File "/home/nahalam/ufsam/src/app.py", line 242, in main
train_loop(model, optimizer, train_dataloader)
File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
outputs = model(pixel_values=batch["pixel_values"].to(device),
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
if self.module.training:
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
[Previous line repeated 1490 more times]
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
if name in dir(self):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
0%| | 0/15 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
main()
File "/home/nahalam/ufsam/src/app.py", line 242, in main
train_loop(model, optimizer, train_dataloader)
File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
outputs = model(pixel_values=batch["pixel_values"].to(device),
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
if self.module.training:
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
[Previous line repeated 1490 more times]
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
if name in dir(self):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
Traceback (most recent call last):
File "/home/nahalam/ufsam/src/app.py", line 251, in <module>
main()
File "/home/nahalam/ufsam/src/app.py", line 242, in main
train_loop(model, optimizer, train_dataloader)
File "/home/nahalam/ufsam/src/app.py", line 182, in train_loop
outputs = model(pixel_values=batch["pixel_values"].to(device),
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RecursionError: Caught RecursionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1772, in forward
if self.module.training:
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
return getattr(self, name)
[Previous line repeated 1490 more times]
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 460, in __getattr__
if name in dir(self):
File "/home/nahalam/ufsam/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
@mrwyattii updated the stack trace
same error~
hi guys, any solutions here?
Same error.I think I've found the cause of this error, but I don't know how to solve it. #6534
Hi all,
I can't run the above script due to the lack of dataset and loss function, but you are using model = torch.nn.DataParallel(model)
.
You don't need to wrap a model when you use DeepSpeed.
Closing as we didn't have an activity for a while. Please feel free to reopen it if you have any update.
Describe the bug Multigpu training with deepspeed gives me this error. The training works fine in single gpu.
To Reproduce I get the recursion error in the training loop with this line
My code looks like below
Expected behavior It should run fine ..
ds_report output
[2023-10-02 16:52:56,487] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/..../env/lib/python3.9/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/..../env/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.10.3, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.5 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0 shared memory (/dev/shm) size .... 755.27 GB
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context deepspeed --num_gpus=2 app.py
Docker context Conda environment with python 3.9 Requirements.txt below
Additional context Add any other context about the problem here.