[BUG] "with deepspeed.zero.Init()" is not idempotent

Describe the bug Intuitively, the Init() context seems like it should be idempotent. It should activate model partitioning, and calling it again shouldn't have any unexpected consequences.

However, currently, nesting the Init() context causes a hard to debug infinite recursion. This causes bugs in libraries that use DeepSpeed, notably Huggingface Transformers. For example, loading a pre-trained checkpoint of a custom encoder-decoder stack that uses the EncoderDecoderModel class results in a model that crashes when you train it.

It looks like this:

deepspeed train_bert_ds.py --checkpoint_dir . --num_layers 2 --h_dim 32                                      (deepspeed) 
[2023-04-12 16:22:59,798] [WARNING] [runner.py:181:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-12 16:22:59,806] [INFO] [runner.py:527:main] cmd = /home/eeisenst/miniconda3/envs/deepspeed/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_bert_ds.py --checkpoint_dir . --num_layers 2 --h_dim 32
[2023-04-12 16:23:00,536] [INFO] [launch.py:133:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-12 16:23:00,536] [INFO] [launch.py:139:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-12 16:23:00,536] [INFO] [launch.py:150:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-12 16:23:00,536] [INFO] [launch.py:151:main] dist_world_size=1
[2023-04-12 16:23:00,536] [INFO] [launch.py:153:main] Setting CUDA_VISIBLE_DEVICES=0
2023-04-12 16:23:01.543 | INFO     | __main__:log_dist:53 - [Rank 0] Creating Experiment Directory
2023-04-12 16:23:01.562 | INFO     | __main__:log_dist:53 - [Rank 0] Experiment Directory created at bert_pretrain.2023.4.12.13.23.1.addjtvxg
2023-04-12 16:23:01.562 | INFO     | __main__:log_dist:53 - [Rank 0] Creating Datasets
Reusing dataset wikitext (/home/eeisenst/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Parameter 'function'=<function create_data_iterator.<locals>.<lambda> at 0x7f2f60bf3160> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Loading cached processed dataset at /home/eeisenst/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-629f6fbed82c07cd.arrow
Loading cached processed dataset at /home/eeisenst/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-e3e70682c2094cac.arrow
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
2023-04-12 16:23:02.581 | INFO     | __main__:log_dist:53 - [Rank 0] Dataset Creation Done
2023-04-12 16:23:02.581 | INFO     | __main__:log_dist:53 - [Rank 0] Creating Model
[2023-04-12 16:23:02,581] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-12 16:23:03,338] [INFO] [partition_parameters.py:436:__exit__] finished initializing model with 0.00B parameters
[2023-04-12 16:23:03,338] [INFO] [partition_parameters.py:436:__exit__] finished initializing model with 0.00B parameters
2023-04-12 16:23:03.338 | INFO     | __main__:log_dist:53 - [Rank 0] Model Creation Done
2023-04-12 16:23:03.338 | INFO     | __main__:log_dist:53 - [Rank 0] Creating DeepSpeed engine
[2023-04-12 16:23:03,339] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.0+458e65f, git-hash=458e65f, git-branch=master
Traceback (most recent call last):
  File "/home/eeisenst/workspace/contribs/DeepSpeedExamples/training/HelloDeepSpeed/train_bert_ds.py", line 861, in <module>
    fire.Fire(train)
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/eeisenst/workspace/contribs/DeepSpeedExamples/training/HelloDeepSpeed/train_bert_ds.py", line 799, in train
    model, _, _, _ = deepspeed.initialize(model=model,
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/__init__.py", line 156, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 377, in wrapper
    if not hasattr(module, "_ds_child_entered"):
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
...
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2403, in __dir__
    parameters = list(self._parameters.keys())
  File "/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed/runtime/engine.py", line 469, in __getattr__
    if name in dir(self):
  File "/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2401, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
[2023-04-12 16:23:04,544] [INFO] [launch.py:297:sigkill_handler] Killing subprocess 902084
[2023-04-12 16:23:04,544] [ERROR] [launch.py:303:sigkill_handler] ['/home/eeisenst/miniconda3/envs/deepspeed/bin/python3.9', '-u', 'train_bert_ds.py', '--local_rank=0', '--checkpoint_dir', '.', '--num_layers', '2', '--h_dim', '32'] exits with return code = 1

To Reproduce

In DeepSpeedExamples, go to training/HelloDeepSpeed/train_bert_ds.py and change line 433 from

    roberta_config = RobertaConfig.from_dict(roberta_config_dict)

    with deepspeed.zero.Init():
        with deepspeed.zero.Init():
            roberta_config = RobertaConfig.from_dict(roberta_config_dict)

Change the DeepSpeed config to ZeRO stage 3. You may need to delete line 174 for it to not crash.

To reproduce in Huggingface, try fine-tuning an EncoderDecoderModel that's loaded from a pre-trained checkpoint with ZeRO stage 3.

Expected behavior Making a nested Init() context should be a no-op or provide some sort of descriptive error. It should not result in an infinite recursion.

ds_report output

ds_report                                                                                                    (deepspeed) 
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 but detected 2.0
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/eeisenst/miniconda3/envs/deepspeed/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0.post200
deepspeed install path ........... ['/home/eeisenst/workspace/contribs/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.0+458e65f, 458e65f, master
torch cuda version ............... 11.2
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.2

System info (please complete the following information):

OS: Fedora release 37 (Thirty Seven)
1 machine with one RTX 3080 Ti
Python 3.9.16

Launcher context

deepspeed train_bert_ds.py --checkpoint_dir . --num_layers 2 --h_dim 32

Docker context No docker images.

microsoft / DeepSpeed

[BUG] "with deepspeed.zero.Init()" is not idempotent #3202