Open fcampagnexandr opened 3 years ago
@fcampagnexandr, thanks for using DeepSpeed and reporting this issue. Can you please provide more details on how to repro?
@fcampagnexandr, are you still seeing this issue?
Yes, I we are seeing the issue. This is an issue we encounter with a pytorch implementation of the perceiver model architecture. Will try to come up with a simple code for reproduction, but no promise, it's quite busy at my end.
That is understandable. Please share at your convenience. Thanks so much.
Here's the code to reproduce: https://github.com/fac2003/repro_deepspeed_multi_modality_perceiver_stage_3/blob/main/DeepSpeedMultiModalityPerceiver.ipynb Same code works if you switch optimization_stage to 2 from 3.
I checked the repro code against Deepspeed 0.3.15 (latest at this time) and found the issue no longer occurs in this release. Seems fixed.
@fac2003, thanks for sharing your experience wth 0.3.15. What is strange is that I am able to repro the issue on 0.3.15, so I am curious why we have different observations :). I traced the problem to lack of support for autocast in ZeRO 3, and I just created a PR. Can you please share your ds_report again?
This is strange indeed. Here's the report with 0.3.15 on colab:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.15, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.1
@fac2003, I think I may have figured out the mystery. But I need your help to be sure. Can you please wrap your model creation in deepspeed.zero.Init() context? Such as below?
with deepspeed.zero.Init(
):
model = MultiModalityWithTextPerceiver(
modalities=(video_modality, image_modality),
depth=2, # depth of net, combined with num_latent_blocks_per_layer to produce full Perceiver
num_latents=12,
# number of latents, or induced set points, or centroids. different papers giving it different names
...
Does this repro the problem?
Indeed, I am also getting the exception with 0.3.15 after wrapping in zero.Init: RuntimeError: expected scalar type Half but found Float
Excellent! Thanks for the confirmation. What happened is that ZeRO 3 has an optimized linear layer that does not support amp autocasting. This linear layer was previous enabled by default when using ZeRO 3, which is why you originally ran into the issue. Now, a recent change made deepspeed.zero.Init() context a requirement to get this ZeRO 3 linear layer, which is why you no longer saw it. Regardless, we need this PR to add support for amp autocast. Thanks so much.
On a separate note, to get the full benefit of ZeRO 3 and ZeRO-Infinity, do need to wrap your model with deepspeed.zero.Init().
@fac2003, the fix is now merged can you please verify so this issue can be closed?
Cloning master and building from source fixed the initial error, but I am seeing a new one with stage 3 on backward:
Pasting here the ds_report output and training run output from repro steps running in colab.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.15+03d24fe, 03d24fe, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0
2021-04-24 16:16:23,518] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2021-04-24 16:16:23,965] [INFO] [distributed.py:89:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.28.0.2, master_port=29500
[2021-04-24 16:16:23,966] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset.
[2021-04-24 16:16:27,160] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+03d24fe, git-hash=03d24fe, git-branch=master
[2021-04-24 16:16:27,162] [WARNING] [config.py:78:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-04-24 16:16:27,163] [WARNING] [config.py:78:_sanity_check] DeepSpeedConfig: cpu_offload_params is deprecated. Please use offload_param.
[2021-04-24 16:16:27,174] [INFO] [engine.py:80:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Time to load fused_adam op: 21.77649974822998 seconds
[2021-04-24 16:16:49,666] [INFO] [engine.py:616:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-04-24 16:16:49,667] [INFO] [engine.py:620:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2021-04-24 16:16:49,673] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
Initializing ZeRO Stage 3
[2021-04-24 16:16:49,746] [INFO] [utils.py:583:see_memory_usage] Stage 3 initialize beginning
[2021-04-24 16:16:49,749] [INFO] [utils.py:588:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2021-04-24 16:16:49,753] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory: used = 2.66 GB, percent = 20.9%
[2021-04-24 16:16:49,756] [INFO] [stage3.py:624:__init__] Reduce bucket size 300000
[2021-04-24 16:16:49,758] [INFO] [stage3.py:625:__init__] Allgather bucket size 20000
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/utils...
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:346: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:354: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
FutureWarning)
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module utils...
Time to load utils op: 12.045355319976807 seconds
[2021-04-24 16:17:01,878] [INFO] [utils.py:583:see_memory_usage] Before creating fp16 partitions
[2021-04-24 16:17:01,880] [INFO] [utils.py:588:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2021-04-24 16:17:01,883] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory: used = 2.66 GB, percent = 20.9%
[2021-04-24 16:17:01,885] [INFO] [stage3.py:39:print_rank_0] fp16 group 0 has 1 subgroups
[2021-04-24 16:17:01,901] [INFO] [stage3.py:39:print_rank_0] Swappable FP32 Partitions: count=0 size= 0.00 GB
[2021-04-24 16:17:01,902] [INFO] [stage3.py:39:print_rank_0] In-Memory FP32 Partitions: count=1 size= 0.00 GB
[2021-04-24 16:17:01,905] [INFO] [stage3.py:819:__init__] optimizer state initialized
[2021-04-24 16:17:01,906] [INFO] [stage3.py:39:print_rank_0] Largest partitioned param numel = 377914
[2021-04-24 16:17:01,992] [INFO] [utils.py:583:see_memory_usage] After initializing ZeRO optimizer
[2021-04-24 16:17:01,994] [INFO] [utils.py:588:see_memory_usage] MA 0.01 GB Max_MA 0.01 GB CA 0.02 GB Max_CA 0 GB
[2021-04-24 16:17:01,996] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory: used = 2.66 GB, percent = 20.9%
[2021-04-24 16:17:01,999] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2021-04-24 16:17:02,000] [INFO] [engine.py:451:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-04-24 16:17:02,002] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fbba0557410>
[2021-04-24 16:17:02,004] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2021-04-24 16:17:02,005] [INFO] [config.py:743:print] DeepSpeedEngine configuration:
[2021-04-24 16:17:02,007] [INFO] [config.py:747:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2021-04-24 16:17:02,009] [INFO] [config.py:747:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-04-24 16:17:02,011] [INFO] [config.py:747:print] allreduce_always_fp32 ........ False
[2021-04-24 16:17:02,012] [INFO] [config.py:747:print] amp_enabled .................. False
[2021-04-24 16:17:02,013] [INFO] [config.py:747:print] amp_params ................... False
[2021-04-24 16:17:02,015] [INFO] [config.py:747:print] checkpoint_tag_validation_enabled True
[2021-04-24 16:17:02,016] [INFO] [config.py:747:print] checkpoint_tag_validation_fail False
[2021-04-24 16:17:02,018] [INFO] [config.py:747:print] disable_allgather ............ False
[2021-04-24 16:17:02,019] [INFO] [config.py:747:print] dump_state ................... False
[2021-04-24 16:17:02,021] [INFO] [config.py:747:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-04-24 16:17:02,023] [INFO] [config.py:747:print] elasticity_enabled ........... False
[2021-04-24 16:17:02,024] [INFO] [config.py:747:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 3,
"detailed": true
}
[2021-04-24 16:17:02,025] [INFO] [config.py:747:print] fp16_enabled ................. True
[2021-04-24 16:17:02,027] [INFO] [config.py:747:print] global_rank .................. 0
[2021-04-24 16:17:02,028] [INFO] [config.py:747:print] gradient_accumulation_steps .. 1
[2021-04-24 16:17:02,030] [INFO] [config.py:747:print] gradient_clipping ............ 0.0
[2021-04-24 16:17:02,031] [INFO] [config.py:747:print] gradient_predivide_factor .... 1.0
[2021-04-24 16:17:02,033] [INFO] [config.py:747:print] initial_dynamic_scale ........ 4294967296
[2021-04-24 16:17:02,035] [INFO] [config.py:747:print] loss_scale ................... 1
[2021-04-24 16:17:02,037] [INFO] [config.py:747:print] memory_breakdown ............. False
[2021-04-24 16:17:02,038] [INFO] [config.py:747:print] optimizer_legacy_fusion ...... False
[2021-04-24 16:17:02,039] [INFO] [config.py:747:print] optimizer_name ............... adam
[2021-04-24 16:17:02,041] [INFO] [config.py:747:print] optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-04-24 16:17:02,044] [INFO] [config.py:747:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-04-24 16:17:02,046] [INFO] [config.py:747:print] pld_enabled .................. False
[2021-04-24 16:17:02,050] [INFO] [config.py:747:print] pld_params ................... False
[2021-04-24 16:17:02,051] [INFO] [config.py:747:print] prescale_gradients ........... False
[2021-04-24 16:17:02,053] [INFO] [config.py:747:print] scheduler_name ............... WarmupLR
[2021-04-24 16:17:02,055] [INFO] [config.py:747:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 0.001, 'warmup_num_steps': 1000}
[2021-04-24 16:17:02,057] [INFO] [config.py:747:print] sparse_attention ............. None
[2021-04-24 16:17:02,059] [INFO] [config.py:747:print] sparse_gradients_enabled ..... False
[2021-04-24 16:17:02,062] [INFO] [config.py:747:print] steps_per_print .............. 2000
[2021-04-24 16:17:02,063] [INFO] [config.py:747:print] tensorboard_enabled .......... False
[2021-04-24 16:17:02,065] [INFO] [config.py:747:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-04-24 16:17:02,066] [INFO] [config.py:747:print] tensorboard_output_path ......
[2021-04-24 16:17:02,070] [INFO] [config.py:747:print] train_batch_size ............. 3
[2021-04-24 16:17:02,072] [INFO] [config.py:747:print] train_micro_batch_size_per_gpu 3
[2021-04-24 16:17:02,077] [INFO] [config.py:747:print] wall_clock_breakdown ......... False
[2021-04-24 16:17:02,086] [INFO] [config.py:747:print] world_size ................... 1
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:346: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:354: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
FutureWarning)
[2021-04-24 16:17:02,097] [INFO] [config.py:747:print] zero_allow_untested_optimizer False
[2021-04-24 16:17:02,099] [INFO] [config.py:747:print] zero_config .................. {
"stage": 3,
"contiguous_gradients": false,
"reduce_scatter": false,
"reduce_bucket_size": 3.000000e+05,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": false,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+06,
"prefetch_bucket_size": 2.000000e+04,
"param_persistence_threshold": 1.000000e+04,
"max_live_parameters": 6.000000e+05,
"max_reuse_distance": 1.000000e+07,
"gather_fp16_weights_on_model_save": false
}
[2021-04-24 16:17:02,100] [INFO] [config.py:747:print] zero_enabled ................. True
[2021-04-24 16:17:02,101] [INFO] [config.py:747:print] zero_optimization_stage ...... 3
[2021-04-24 16:17:02,103] [INFO] [config.py:754:print] json = {
"train_batch_size": 3,
"steps_per_print": 2.000000e+03,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-08,
"weight_decay": 3e-07
}
},
"fp16": {
"enabled": true,
"loss_scale": 1,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
},
"wall_clock_breakdown": false,
"zero_optimization": {
"stage": 3,
"cpu_offload": false,
"cpu_offload_params": false,
"overlap_comm": false,
"contiguous_gradients": false,
"stage3_max_live_parameters": 6.000000e+05,
"stage3_max_reuse_distance": 1.000000e+07,
"stage3_prefetch_bucket_size": 2.000000e+04,
"stage3_param_persistence_threshold": 1.000000e+04,
"reduce_bucket_size": 3.000000e+05,
"sub_group_size": 1.000000e+06
}
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007159709930419922 seconds
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-250063ddea6d> in <module>()
124 loss = outputs.mean()
125
--> 126 model_engine.backward(loss)
127 model_engine.step()
128 print("DONE")
7 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/linear.py in backward(ctx, grad_output)
89 if ctx.needs_input_grad[0]:
90 #print(f"Computing grad input weight {weight.shape} grad_output {grad_output.shape}")
---> 91 grad_input = grad_output.matmul(weight)
92 #print(f"Computed grad input {grad_input.shape}")
93 if ctx.needs_input_grad[1]:
RuntimeError: mat1 dim 1 must match mat2 dim 0
Here's the updated repro notebook installing from DeepSpeed master and wrapping model in zero init:
@fac2003, thanks for reporting this new issue. Can you please open a new issue report for this?
Why not continue with this one until the repro works and the issue is fully fixed? The title still matches the problem, there is an exception when turning on stage 3 for this model. I'd rather not have to provide the repro in yet another issue. Would that be OK with you?
There are number of benefits to restricting an issue report to one issue or bug, rather than adding multiple, even related, bugs into a report.
This guide on best practices for GitHub issues gives a number of reasons, including a principle of One issue per issue
.
Besides enabling collaboration of the reporter and fixer of an issue, the issue report also serves as an archive or historical record for future users who want to quickly know whether their issue is similar to old ones or completely new. It is a lot easier for those users to filter issue reports if issue reports are shorter, simpler, and describe one error message or stack trace, as much as possible.
It is also much cleaner if one PR is associated with one issue report because it is makes much easier to find regressions and port fixes to other branches.
Regarding the title matching the problem, do note that the title is overly broad since many kinds of Python bugs manifest as exceptions in the interpreter. And so this is not a good reason to associate every bug triggers a Python interpreter exception under the same issue report, since dozens of bugs would qualify.
@tjruwase It is the same issue from my point of view because the repro code has not changed, neither has the description, we are now hitting a new exception further in the repro code, so this is progress, but the issue is not resoled. The fix PR you developed perhaps has only addressed the forward. I can open another issue if it's easier to track with PRs at your end, but I would not close this one until the repro does work. Thanks for your help so far.
As requested, I created a new issue here: https://github.com/microsoft/DeepSpeed/issues/1006
@fac2003, I greatly appreciate the help. And yes, I don't expect you to close this issue until all issues are resolved to your satisfaction.
Using CUDA 11.1, pytorch 1.8.1 and Deepspeed 0.3.14.
Model trains with FP16 and optimization_stage 2, but fails with optimization_stage 3 with the following exception:
b' model_engine.backward(loss)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 997, in backward\n' b' self.optimizer.backward(loss)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 2555, in backward\n' b' self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward\n' b' scaled_loss.backward(retain_graph=retain_graph)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward\n' b' torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward\n' b' allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/function.py", line 89, in apply\n' b' return self._forward_cls.backward(self, *args) # type: ignore\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py", line 85, in backward\n' b' grad_weight = grad_output.t().matmul(input)\n' b'RuntimeError: expected scalar type Half but found Float\n'