microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.29k stars 4.09k forks source link

Exception with optimization_stage_3 #944

Open fcampagnexandr opened 3 years ago

fcampagnexandr commented 3 years ago

Using CUDA 11.1, pytorch 1.8.1 and Deepspeed 0.3.14.

Model trains with FP16 and optimization_stage 2, but fails with optimization_stage 3 with the following exception:

b' model_engine.backward(loss)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 997, in backward\n' b' self.optimizer.backward(loss)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 2555, in backward\n' b' self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward\n' b' scaled_loss.backward(retain_graph=retain_graph)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward\n' b' torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward\n' b' allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag\n' b' File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/function.py", line 89, in apply\n' b' return self._forward_cls.backward(self, *args) # type: ignore\n' b' File "/opt/miniconda/lib/python3.7/site-packages/deepspeed/runtime/zero/linear.py", line 85, in backward\n' b' grad_weight = grad_output.t().matmul(input)\n' b'RuntimeError: expected scalar type Half but found Float\n'

tjruwase commented 3 years ago

@fcampagnexandr, thanks for using DeepSpeed and reporting this issue. Can you please provide more details on how to repro?

tjruwase commented 3 years ago

@fcampagnexandr, are you still seeing this issue?

fcampagnexandr commented 3 years ago

Yes, I we are seeing the issue. This is an issue we encounter with a pytorch implementation of the perceiver model architecture. Will try to come up with a simple code for reproduction, but no promise, it's quite busy at my end.

tjruwase commented 3 years ago

That is understandable. Please share at your convenience. Thanks so much.

fac2003 commented 3 years ago

Here's the code to reproduce: https://github.com/fac2003/repro_deepspeed_multi_modality_perceiver_stage_3/blob/main/DeepSpeedMultiModalityPerceiver.ipynb Same code works if you switch optimization_stage to 2 from 3.

fac2003 commented 3 years ago

I checked the repro code against Deepspeed 0.3.15 (latest at this time) and found the issue no longer occurs in this release. Seems fixed.

tjruwase commented 3 years ago

@fac2003, thanks for sharing your experience wth 0.3.15. What is strange is that I am able to repro the issue on 0.3.15, so I am curious why we have different observations :). I traced the problem to lack of support for autocast in ZeRO 3, and I just created a PR. Can you please share your ds_report again?

fac2003 commented 3 years ago

This is strange indeed. Here's the report with 0.3.15 on colab:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.15, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.1
tjruwase commented 3 years ago

@fac2003, I think I may have figured out the mystery. But I need your help to be sure. Can you please wrap your model creation in deepspeed.zero.Init() context? Such as below?

with deepspeed.zero.Init(
):
    model = MultiModalityWithTextPerceiver(
        modalities=(video_modality, image_modality),
        depth=2,  # depth of net, combined with num_latent_blocks_per_layer to produce full Perceiver
        num_latents=12,
        # number of latents, or induced set points, or centroids. different papers giving it different names
       ...

Does this repro the problem?

fac2003 commented 3 years ago

Indeed, I am also getting the exception with 0.3.15 after wrapping in zero.Init: RuntimeError: expected scalar type Half but found Float

tjruwase commented 3 years ago

Excellent! Thanks for the confirmation. What happened is that ZeRO 3 has an optimized linear layer that does not support amp autocasting. This linear layer was previous enabled by default when using ZeRO 3, which is why you originally ran into the issue. Now, a recent change made deepspeed.zero.Init() context a requirement to get this ZeRO 3 linear layer, which is why you no longer saw it. Regardless, we need this PR to add support for amp autocast. Thanks so much.

On a separate note, to get the full benefit of ZeRO 3 and ZeRO-Infinity, do need to wrap your model with deepspeed.zero.Init().

tjruwase commented 3 years ago

@fac2003, the fix is now merged can you please verify so this issue can be closed?

fac2003 commented 3 years ago

Cloning master and building from source fixed the initial error, but I am seeing a new one with stage 3 on backward:

Pasting here the ds_report output and training run output from repro steps running in colab.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
async_io ............... [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.15+03d24fe, 03d24fe, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0
2021-04-24 16:16:23,518] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2021-04-24 16:16:23,965] [INFO] [distributed.py:89:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.28.0.2, master_port=29500
[2021-04-24 16:16:23,966] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset.
[2021-04-24 16:16:27,160] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+03d24fe, git-hash=03d24fe, git-branch=master
[2021-04-24 16:16:27,162] [WARNING] [config.py:78:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-04-24 16:16:27,163] [WARNING] [config.py:78:_sanity_check] DeepSpeedConfig: cpu_offload_params is deprecated. Please use offload_param.
[2021-04-24 16:16:27,174] [INFO] [engine.py:80:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Time to load fused_adam op: 21.77649974822998 seconds
[2021-04-24 16:16:49,666] [INFO] [engine.py:616:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-04-24 16:16:49,667] [INFO] [engine.py:620:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2021-04-24 16:16:49,673] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
Initializing ZeRO Stage 3
[2021-04-24 16:16:49,746] [INFO] [utils.py:583:see_memory_usage] Stage 3 initialize beginning
[2021-04-24 16:16:49,749] [INFO] [utils.py:588:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2021-04-24 16:16:49,753] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory:  used = 2.66 GB, percent = 20.9%
[2021-04-24 16:16:49,756] [INFO] [stage3.py:624:__init__] Reduce bucket size 300000
[2021-04-24 16:16:49,758] [INFO] [stage3.py:625:__init__] Allgather bucket size 20000
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/utils...
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:346: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:354: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  FutureWarning)
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module utils...
Time to load utils op: 12.045355319976807 seconds
[2021-04-24 16:17:01,878] [INFO] [utils.py:583:see_memory_usage] Before creating fp16 partitions
[2021-04-24 16:17:01,880] [INFO] [utils.py:588:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2021-04-24 16:17:01,883] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory:  used = 2.66 GB, percent = 20.9%
[2021-04-24 16:17:01,885] [INFO] [stage3.py:39:print_rank_0] fp16 group 0 has 1 subgroups
[2021-04-24 16:17:01,901] [INFO] [stage3.py:39:print_rank_0] Swappable FP32 Partitions: count=0 size= 0.00 GB
[2021-04-24 16:17:01,902] [INFO] [stage3.py:39:print_rank_0] In-Memory FP32 Partitions: count=1 size= 0.00 GB
[2021-04-24 16:17:01,905] [INFO] [stage3.py:819:__init__] optimizer state initialized
[2021-04-24 16:17:01,906] [INFO] [stage3.py:39:print_rank_0] Largest partitioned param numel = 377914
[2021-04-24 16:17:01,992] [INFO] [utils.py:583:see_memory_usage] After initializing ZeRO optimizer
[2021-04-24 16:17:01,994] [INFO] [utils.py:588:see_memory_usage] MA 0.01 GB         Max_MA 0.01 GB         CA 0.02 GB         Max_CA 0 GB 
[2021-04-24 16:17:01,996] [INFO] [utils.py:593:see_memory_usage] CPU Virtual Memory:  used = 2.66 GB, percent = 20.9%
[2021-04-24 16:17:01,999] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2021-04-24 16:17:02,000] [INFO] [engine.py:451:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-04-24 16:17:02,002] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fbba0557410>
[2021-04-24 16:17:02,004] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2021-04-24 16:17:02,005] [INFO] [config.py:743:print] DeepSpeedEngine configuration:
[2021-04-24 16:17:02,007] [INFO] [config.py:747:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2021-04-24 16:17:02,009] [INFO] [config.py:747:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-04-24 16:17:02,011] [INFO] [config.py:747:print]   allreduce_always_fp32 ........ False
[2021-04-24 16:17:02,012] [INFO] [config.py:747:print]   amp_enabled .................. False
[2021-04-24 16:17:02,013] [INFO] [config.py:747:print]   amp_params ................... False
[2021-04-24 16:17:02,015] [INFO] [config.py:747:print]   checkpoint_tag_validation_enabled  True
[2021-04-24 16:17:02,016] [INFO] [config.py:747:print]   checkpoint_tag_validation_fail  False
[2021-04-24 16:17:02,018] [INFO] [config.py:747:print]   disable_allgather ............ False
[2021-04-24 16:17:02,019] [INFO] [config.py:747:print]   dump_state ................... False
[2021-04-24 16:17:02,021] [INFO] [config.py:747:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-04-24 16:17:02,023] [INFO] [config.py:747:print]   elasticity_enabled ........... False
[2021-04-24 16:17:02,024] [INFO] [config.py:747:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 3, 
    "detailed": true
}
[2021-04-24 16:17:02,025] [INFO] [config.py:747:print]   fp16_enabled ................. True
[2021-04-24 16:17:02,027] [INFO] [config.py:747:print]   global_rank .................. 0
[2021-04-24 16:17:02,028] [INFO] [config.py:747:print]   gradient_accumulation_steps .. 1
[2021-04-24 16:17:02,030] [INFO] [config.py:747:print]   gradient_clipping ............ 0.0
[2021-04-24 16:17:02,031] [INFO] [config.py:747:print]   gradient_predivide_factor .... 1.0
[2021-04-24 16:17:02,033] [INFO] [config.py:747:print]   initial_dynamic_scale ........ 4294967296
[2021-04-24 16:17:02,035] [INFO] [config.py:747:print]   loss_scale ................... 1
[2021-04-24 16:17:02,037] [INFO] [config.py:747:print]   memory_breakdown ............. False
[2021-04-24 16:17:02,038] [INFO] [config.py:747:print]   optimizer_legacy_fusion ...... False
[2021-04-24 16:17:02,039] [INFO] [config.py:747:print]   optimizer_name ............... adam
[2021-04-24 16:17:02,041] [INFO] [config.py:747:print]   optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-04-24 16:17:02,044] [INFO] [config.py:747:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-04-24 16:17:02,046] [INFO] [config.py:747:print]   pld_enabled .................. False
[2021-04-24 16:17:02,050] [INFO] [config.py:747:print]   pld_params ................... False
[2021-04-24 16:17:02,051] [INFO] [config.py:747:print]   prescale_gradients ........... False
[2021-04-24 16:17:02,053] [INFO] [config.py:747:print]   scheduler_name ............... WarmupLR
[2021-04-24 16:17:02,055] [INFO] [config.py:747:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 0.001, 'warmup_num_steps': 1000}
[2021-04-24 16:17:02,057] [INFO] [config.py:747:print]   sparse_attention ............. None
[2021-04-24 16:17:02,059] [INFO] [config.py:747:print]   sparse_gradients_enabled ..... False
[2021-04-24 16:17:02,062] [INFO] [config.py:747:print]   steps_per_print .............. 2000
[2021-04-24 16:17:02,063] [INFO] [config.py:747:print]   tensorboard_enabled .......... False
[2021-04-24 16:17:02,065] [INFO] [config.py:747:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-04-24 16:17:02,066] [INFO] [config.py:747:print]   tensorboard_output_path ...... 
[2021-04-24 16:17:02,070] [INFO] [config.py:747:print]   train_batch_size ............. 3
[2021-04-24 16:17:02,072] [INFO] [config.py:747:print]   train_micro_batch_size_per_gpu  3
[2021-04-24 16:17:02,077] [INFO] [config.py:747:print]   wall_clock_breakdown ......... False
[2021-04-24 16:17:02,086] [INFO] [config.py:747:print]   world_size ................... 1
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:346: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
/usr/local/lib/python3.7/dist-packages/torch/cuda/memory.py:354: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  FutureWarning)
[2021-04-24 16:17:02,097] [INFO] [config.py:747:print]   zero_allow_untested_optimizer  False
[2021-04-24 16:17:02,099] [INFO] [config.py:747:print]   zero_config .................. {
    "stage": 3, 
    "contiguous_gradients": false, 
    "reduce_scatter": false, 
    "reduce_bucket_size": 3.000000e+05, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": false, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+06, 
    "prefetch_bucket_size": 2.000000e+04, 
    "param_persistence_threshold": 1.000000e+04, 
    "max_live_parameters": 6.000000e+05, 
    "max_reuse_distance": 1.000000e+07, 
    "gather_fp16_weights_on_model_save": false
}
[2021-04-24 16:17:02,100] [INFO] [config.py:747:print]   zero_enabled ................. True
[2021-04-24 16:17:02,101] [INFO] [config.py:747:print]   zero_optimization_stage ...... 3
[2021-04-24 16:17:02,103] [INFO] [config.py:754:print]   json = {
    "train_batch_size": 3, 
    "steps_per_print": 2.000000e+03, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.001, 
            "betas": [0.8, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 3e-07
        }
    }, 
    "fp16": {
        "enabled": true, 
        "loss_scale": 1, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 0.001, 
            "warmup_num_steps": 1000
        }
    }, 
    "wall_clock_breakdown": false, 
    "zero_optimization": {
        "stage": 3, 
        "cpu_offload": false, 
        "cpu_offload_params": false, 
        "overlap_comm": false, 
        "contiguous_gradients": false, 
        "stage3_max_live_parameters": 6.000000e+05, 
        "stage3_max_reuse_distance": 1.000000e+07, 
        "stage3_prefetch_bucket_size": 2.000000e+04, 
        "stage3_param_persistence_threshold": 1.000000e+04, 
        "reduce_bucket_size": 3.000000e+05, 
        "sub_group_size": 1.000000e+06
    }
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007159709930419922 seconds
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-250063ddea6d> in <module>()
    124       loss = outputs.mean()
    125 
--> 126       model_engine.backward(loss)
    127       model_engine.step()
    128 print("DONE")

7 frames
/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/linear.py in backward(ctx, grad_output)
     89         if ctx.needs_input_grad[0]:
     90             #print(f"Computing grad input weight {weight.shape} grad_output {grad_output.shape}")
---> 91             grad_input = grad_output.matmul(weight)
     92             #print(f"Computed grad input {grad_input.shape}")
     93         if ctx.needs_input_grad[1]:

RuntimeError: mat1 dim 1 must match mat2 dim 0

fac2003 commented 3 years ago

Here's the updated repro notebook installing from DeepSpeed master and wrapping model in zero init:

https://github.com/fac2003/repro_deepspeed_multi_modality_perceiver_stage_3/blob/main/DeepSpeedMultiModalityPerceiver-from-master.ipynb?short_path=52c54dd

tjruwase commented 3 years ago

@fac2003, thanks for reporting this new issue. Can you please open a new issue report for this?

fcampagnexandr commented 3 years ago

Why not continue with this one until the repro works and the issue is fully fixed? The title still matches the problem, there is an exception when turning on stage 3 for this model. I'd rather not have to provide the repro in yet another issue. Would that be OK with you?

tjruwase commented 3 years ago

There are number of benefits to restricting an issue report to one issue or bug, rather than adding multiple, even related, bugs into a report.

  1. This guide on best practices for GitHub issues gives a number of reasons, including a principle of One issue per issue.

  2. Besides enabling collaboration of the reporter and fixer of an issue, the issue report also serves as an archive or historical record for future users who want to quickly know whether their issue is similar to old ones or completely new. It is a lot easier for those users to filter issue reports if issue reports are shorter, simpler, and describe one error message or stack trace, as much as possible.

  3. It is also much cleaner if one PR is associated with one issue report because it is makes much easier to find regressions and port fixes to other branches.

  4. Regarding the title matching the problem, do note that the title is overly broad since many kinds of Python bugs manifest as exceptions in the interpreter. And so this is not a good reason to associate every bug triggers a Python interpreter exception under the same issue report, since dozens of bugs would qualify.

fac2003 commented 3 years ago

@tjruwase It is the same issue from my point of view because the repro code has not changed, neither has the description, we are now hitting a new exception further in the repro code, so this is progress, but the issue is not resoled. The fix PR you developed perhaps has only addressed the forward. I can open another issue if it's easier to track with PRs at your end, but I would not close this one until the repro does work. Thanks for your help so far.

fac2003 commented 3 years ago

As requested, I created a new issue here: https://github.com/microsoft/DeepSpeed/issues/1006

tjruwase commented 3 years ago

@fac2003, I greatly appreciate the help. And yes, I don't expect you to close this issue until all issues are resolved to your satisfaction.